This is probably a dumb question, but here goes: ...
# general
e
This is probably a dumb question, but here goes: I have a
pex_binary
target, that, when run, reads some files from the local file system. I can't count on those files being present on the local file system, so would like to always run a script to populate those files before the
pex_binary
is run with
pants run :the_pex_binary
. I have a feeling this may be somewhat antithetical to Pants' design, or out of scope, but wanted to check.
I've been looking at
run_shell_command
and
adhoc_tool
but they don't seem quite right. For one, if I define:
Copy code
run_shell_command(
    name="script",
    command="echo hello"
)
and then make it a dependency of the
pex_binary
, it doesn't seem to run.
w
This feels like something maybe the new workspace environment can help with eventually. But, if I’m reading this, you want a side-effect to run before your pex_binary. Will the pex_binary contain the newly created files?
e
you want a side-effect to run before your pex_binary.
that's right. The binary won't contain the newly created files. At least, it doesn't need to. But if that's a way to do this, I'm game to try it.
w
Well, I guess, if it's not necessarily in the pex - why not have a shim inside your code?
e
yeah I'm trying to avoid code changes since the context is a migration from a previous build system. But I could do that. what if I go the route of putting the files in the PEX? It's sort of unclear to me how those files are accessible through general file system APIs.
w
I mean, there might be a way to use the shell script or another thing, but it feels like cramming determinism into a non-deterministic system (or vice versa, I'm not sure). Like: I want reproducible builds, but I conditionally want to change some part of the filesystem, which, after I make my build - might then change again ... before my next build? It's a bit abstract, so I'm trying to make it more concrete. The shim inside a pex file, assumes that you want these files created at runtime (which is what it sounded like), but it's also running through pants. Not suggesting it can't be done, but I haven't done it using Pants. But, using either a macro or a plugin, the world's your oyster
I guess for me, what you're trying to achieve specifically might be more useful - so we're not XYing this problem
e
yeah, fair. So we have a previous build in Earthly. It's got some tasks like "run service X". The service, on startup, loads some BentoML models from the filesystem into memory. So in a dependent Earthly task, we made sure those BentoML models were present in the expected place by reading a YAML file and downloading them using the BentoML CLI (which fetches them from GCS). Then the service would run, see the files, and load them properly. Previously this was all happening in a Docker container, which I could probably do again, but am trying to avoid this time around.
I do have control over where we look for the BentoML model files. So if I can include them in the PEX and point the code at that spot, that's fine.
w
Ah interesting, any reason to move off Earthly? I've never used it. Re question: just need to think for a sec
e
many reasons: • everything running in Docker is incredibly slow (you're always rebuilding images) • it's so general as to be useless - you need to manually define all your tasks like linting/formatting/tests/run/etc. It's not really aware of much besides Docker. • you still need something like Poetry to manage your deps, and once we hit about 8 pyproject.tomls in the repo it just became too much (it's a monorepo, but a small one)
🤯 1
(thanks for humouring me on this weird question, I appreciate it)
ā¤ļø 1
w
Oh wow, okay, I guess i really didn't know much about earthly - I thought it was more than just docker. Cool, thanks for letting me know! So, me personally, just speaking for myself - this feels like the kinda thing I would end up doing in Docker, or in pre-run bindings in a
scie
https://github.com/a-scie/jump I'm pretty sure this would all be do-able in a plugin. Natively though, I'm a little less sure. You've tried shell, which didn't work out (https://www.pantsbuild.org/2.21/docs/shell#testing-your-packaging-pipeline), and adhoc tool would be my next place to look. I'm curious if
archive
might be an option
As far as I know, files and resources are expected to be in place: https://www.pantsbuild.org/2.21/docs/using-pants/assets-and-archives
But yeah, Shell really feels like it would have been the move https://www.pantsbuild.org/2.21/reference/targets/shell_command
Adhoc tool (https://www.pantsbuild.org/2.21/reference/targets/adhoc_tool), being a slightly improved approach
e
ah ok haven't looked at that (archive) yet, thanks for the pointer. It could be that I'm just doing shell wrong too. Docker does seem like the easy route, but I think the difficulty I would have is running the tests. They are integration tests that basically start the service and throw requests at it. So it does all this file loading then too.
w
Yep, fair, I personally avoid docker when I can - as there's a burden in using it, that I often don't need.
This was the thing I've bneen struggling to find as well: https://www.pantsbuild.org/2.21/reference/build-file-symbols/http_source Not sure if there is any value?
e
hmm what does it do exactly?
w
As I understand it, it would allow using a remote URL as an HTTP source, pulling that into the pipeline - but not re-downloading if it exists. So, not quite reading a yaml file, but 🤷
I'm curious about this though, what you've described feels possible - but, as I've never done it - I'm just less knowledgeable as a result. I might give it a shot tonight when I'm home
Also, if you want a practical
adhoc_tool
example, this is one of me building a sveltekit project, so you can see the steps - all of which are technically side-effects, but some of those effects are pulled back into the sandboxes https://gist.github.com/sureshjoshi/98fb09f2a340f7c1dad270c4887865a0
e
I see. This archive/assets docs page is helping a lot actually. I've been thinking about this wrong, not surprisingly.
A
file
target is for loose files that are copied into the chroot where Pants runs your code. You can then load these files through direct mechanisms like Python's
open()
or Java's
FileInputStream
this sounds like what I want. I can basically put the BentoML model files into the repo dir in the expected structure, and have the code load from there. and as far as I understand it,
shell_command
essentially gives you
files
, is that right? So I could then use that to run the BentoML CLI to create the
files
instead of having to put them in manually.
šŸ‘ 1
thanks for that adhoc link, I may need it before long šŸ˜…
w
and as far as I understand it,
shell_command
essentially gives you
files
, is that right? So I could then use that to run the BentoML CLI to create the
files
instead of having to put them in manually.
I don't know about BentoML - but if there is some way to have loose files in your repo, using files/resources is how you would collect them into the system
I'll be back this evening to see if I can get that side-effect concept working. I'm super curious now
e
hah ok thanks, really appreciate your help. I'm going to plug away at it some more, will keep you posted.
šŸ‘ 1
w
So, this ended up being a pretty cool little side track, using some targets I'd never needed to touch before
https://github.com/sureshjoshi/pants-shell-command-example I wrote an example with a few different ways to access local files. Easiest being a local resource, then downloading a resource (via
http_source
) and then the closest I could think to make an easy example of what you want to do. adhoc_tool might be a cleaner way to do it, with more reliably caching, but I used a shell command to call a script to read from a manifest.txt file, and then download an image, and stick it into the pex
In main, I just open each of the files and print something out from it
e
oh man this is super helpful, thanks a ton! I should be able to get something going.
šŸ‘ 1
w
If you have a CLI tool, then the shell command, or adhoc tool are probably safe bets - and then you have to carefully pass dependencies from
output_xyz
to the next layer. That was also the first time I'd needed to use
experimental_wrap_as_resource
since otherwise, a
file
is generated, and it's kinda loose in the pex_binary target
e
yeah I think I'll need to use a
file
in the end, the code that loads the model is doing regular FS access
ok this is interesting. If I comment out the _`experimental_wrap_as_resources`_ and have the pex depend directly on
:run-downloader
, I get this warning when running:
Copy code
āÆ pants run src:bin             
22:00:53.21 [WARN] The target src:bin (`pex_binary`) transitively depends on the below `files` targets, but Pants will not include them in the built package. Filesystem APIs like `open()` may be not able to load files within the binary itself; instead, they read from the current working directory.

Instead, use `resources` targets. See <https://www.pantsbuild.org/resources>.

Files targets dependencies: ['src:run-downloader']
and the
run-downloader.sh
doesn't actually run! (I can tell because I added
exit 1
to the beginning). Is that expected? I get that the PEX won't contain the files, but shouldn't they still be available in the working dir?
w
That's interesting - I guess since the pex won't include it, it just doesn't bother?
Maybe have something else depend on run-downloader?
e
the warning goes away with this, but it still doesn't run. So confused...
Copy code
pex_binary(
    name="bin",
    dependencies=[
        ":lib",
        ":archive"
    ],
    entry_point="main.py"
)

archive(
  name="archive",
  format="zip",
  files=[":run-downloader"],
)

python_sources(
    name="lib",
    dependencies=[
        ":local-file",
        ":downloaded-image",
    ],
    sources=["**/*.py"],
)

# experimental_wrap_as_resources(
#     name="wrapped_downloader_output",
#     inputs=[":run-downloader"],
# )

shell_command(
    name="run-downloader",
    command="./downloader.sh",
    execution_dependencies=[":scripts", ":manifest"],
    output_files=["dilbert-rng.gif"], # This must match your expected file(s) (or use output_directory)
    tools=["curl", "head"],
)
w
and you're still running in my repo?
e
yep
if I do
pants package src:archive
it does run
w
I guess a pex can't contain an archive?
Never thought about that, never tried it
I know you can pass along a pex_binary to an archive, so maybe it's one-directional
And the reason you want it to run this way, is so that if you run
pants run src:bin
you want to ensure the shell command is run
Even if the output isn't put into the pex?
e
yeah exactly
I mean, maybe that doesn't make sense. But I was hoping when the PEX ran it would have access to the
file
, somewhere
if I turn off the
entry_point
to get a REPL, and then poke around with
os
I'm in the source tree. So, maybe I guess I want to put the files into the source tree?
w
🤷 But as I mentioned, might be worth checking out adhoc tool and seeing if maybe that has some of the hooks you're interested in. It's a lot more powerful for making pipelines
e
fair enough, I'll keep looking. Thanks agaiin!
šŸ‘ 1
seems like
adhoc_tool
has the same issues as
shell_command
. But,
run_shell_command
will just let me write files to the project tree which should probably be enough. The only sad part is I still can't make the
pex_binary
depend on the
run_shell_command
so I have to do
pants run src:the_run_shell_command; pants run src:bin
. Was hoping a
cli.alias
would help, but
pants run
only accepts one target. Would be nice if you could do
pants run target1 target2
and just have them run in sequence.