I'm trying to add a new target type that can downl...
# plugins
f
I'm trying to add a new target type that can download files and use them as
files
-like dependencies...i get how to create the target type, but what kind of object do i need to return in my rule to make it available as a dependency of other targets?
GeneratedSources
maybe?
h
Yes, codegen is appropriate for this. I’m assuming that this new target will not have actual original source files, which is fine - you can make the
Sources
subclass set
alias = "_sources"
, which causes the field to be ignored in
./pants help
. See https://github.com/pantsbuild/pants/blob/72d0409e61a58d2390390405d3a87a771feea61b/src/python/pants/core/target_types.py#L62-L186 for how we do a very similar setup with the
relocated_files
target
See https://www.pantsbuild.org/docs/rules-api-file-system#downloadfile for how to safely download files with the engine I realized it’s slightly stale, will update now
👍🏻 1
f
Awesome I got this working pretty quickly. I see how I can use addprefix to rename the file too. How about adding an action after download, like unzip or untar... Is there a shortcut to doing that that or should I just use a process?
h
Oh, if it’s an archive file, check out archive.py for pre-built functionality to extract it: https://github.com/pantsbuild/pants/blob/master/src/python/pants/core/util_rules/archive.py https://github.com/pantsbuild/pants/blob/72d0409e61a58d2390390405d3a87a771feea61b/src/python/pants/core/util_rules/external_tool.py#L242-L245 is an example of downloading a file, plus maybe extracting it. (No op if not an archive)
f
This is dope
h
hats off to @happy-kitchen-89482 for writing it all 😄 and John for the
BinaryPaths
abstraction to safely locate binaries like
tar
and
unzip
on your machine
f
i'm getting an engine exception with with the
ExtractedFile
get:
Copy code
Exception: WithDeps(Inner(InnerEntry { params: {DownloadFileRequest}, rule: Task(Task { product: GeneratedSources, can_modify_workunit: false, clause: [DownloadFileRequest], gets: [Get { output: PyDigest, input: DownloadFile }, Get { output: ExtractedArchive, input: PyDigest }, Get { output: Snapshot, input: PyDigest }], func: jumiopants.download_files:59:download_file(), cacheable: true, display_info: DisplayInfo { name: "jumiopants.download_files.download_file", desc: None, level: Trace } }) })) did not declare a dependency on JustGet(Get { output: Snapshot, input: ExtractedArchive })
h
Are you on 2.1.0.dev1? This error message was recently improved
f
no i'm still on 2.0.0
h
Okay, sg. In the meantime, this error message is when the input to
Get
is different than what’s expected. If you have
Get(Snapshot, ExtractedArchive, "not an extracted archive object")
for example
Do you have this in your code:
await Get(Snapshot, Digest, my_extracted_archive)
?
f
okay i kinda get that from what's being said
yeah i forgot how types work for a second apparently
h
okay i kinda get that from what’s being said
It’s a horrible horrible error message. 2.1.0.dev0 revamps it all
f
reminds me of haskell lol
😂 1
tons of expressive power; terrible error messages
👍 1
h
Hopefully we can graduate to Rust: tons of expressive power, fantastic error messages Big remaining kahuna is when it’s an invalid rule graph
🦍 1
f
is there a way to keep the generated file in view for a
repl
goal?
it seems to show up relative to the process dir root in
run
goals, but for repl it's buried in
.pants.d/
actually i lied...where does this show up for a
run
goal file?
h
That’s what I was confused about. Both run and repl create a tempdir in
.pants.d
, and load the contents on
PYTHONPATH
(via
PEX_EXTRA_SYS_PATH
) iiuc. We use a tmpdir so that we don’t materialize things like codegen into your build root. Unlike normally using a tmpdir in
/tmp
, we use
.pants.d
so that you can reach into files in your build root if you want, since both
run
and
repl
are never cached and we reason it’s safer to be less hermetic
So, to answer your question, I would expect the file to be somewhere like
.pants.d/tmp139131/src/python/project/asset.zip
See https://pantsbuild.slack.com/archives/C046T6T9U/p1604545674208100?thread_ts=1604262956.144200&cid=C046T6T9U about a new
write-codegen
goal I prototyped last night. It could be relevant here, but has a possible dangerous consequence if implemented the way I did I would love a) if you’d like that goal to begin with, and b) which of those two solutions sound better to you
f
i think 2 makes way more sense and is a lot simpler
h
Cool, I agree. Thanks! Then this feature should be good to go for 2.1.0.dev1 tomorrow 🙂
f
my use case is about providing a way to interact with external artifacts, like vendor-provided binaries, or ML model artifacts, but without having to commit them
👍 1
so what i need is for them to be at a predictable path when running the relevant goal
hmm is there a way to turn a
files
target into a
resources
target?
i would only want to do this for local execution targets, not for things i mean to package, but it might be a more straightforward way of handling this
h
Would it work to have your new codegen rule output
ResourcesSources
, rather than
FilesSources
?
f
I can try that
i think the issue is that i would need both, and that might be ambiguous for the rule graph
h
Taking a step back, what’s the objective with using
resources
rather than
files
? They’re very similar, only that
resources
will have the source root included in the PYTHONPATH. (https://www.pantsbuild.org/docs/resources describes this more)
f
to not have to think about loading file paths in these goals
h
So, you’d load them with something like
pkgutil.get_data()
and not need to worry about
.pants.d
, rather than using
open()
, right?
(This might reveal that
relocated_files
won’t work how we want with those two goals. I’ve never understood very well running
repl
and
run
in
.pants.d
, and I wonder if it was a mistake and we should run in a clean tmpdir like normal)
f
Let me explain a bit more...in production, we use an environment variable called
MODEL_PATH
to tell ML inference code where to find model artifacts. These models can be COPY'd into docker images, mounted in there, or loaded via other processes, depending on the deployment. Unfortunately, we don't currently have a good way of emulating this locally to experiment or test (run/repl/test goals, maybe raul's jupyter goal would be nice too).
my idea was to use a
python_library(name="model-loaded")
target that depends on both the inference code's
python_library(name="lib")
target and the
download_file(name="model-v1")
target i'm adding for this artifact
this would work so that, say,
./pants repl path/to/build:model-loaded
would download the model and insert it into some predictable path within the execution environment so that someone could load it. I think i'd actually rather this happen via
open()
than
pkgutil
, but I need some way for the execution process to have access to that information
👍 1
that could work via some kind of environment variable or just a well-known path
i guess right now i can do
Path('.pants.d').glob('**/my_file')
but that's clearly a hack
h
Yeah, I personally think this a smell that having
run
and
repl
run in the build root under a tmpdir isn’t worth it. There’s this issue you’re running into with code generated
files
, and there’s also an issue that it’s confusing to have both
.pants.d/tmp131/src/python/project/app.py
and normal
src/python/project/app.py
. From the user’s perspective, how are those different? We went with running in the build root under the idea that
repl
is more experimental, so it’s fine to break hermiticity. But I don’t know if that is worth it. And we don’t use that logic with
test --debug
cc @happy-kitchen-89482 what do you think? @hundreds-breakfast-49010 and @witty-crayon-22786 too
w
Will review shortly. But having read only Eric's last so far: anything that is intended to be edited should live in the repo, and be edited there for the purposes of live-reload usecases
h
Oh, hm, I remember a reason we had
run
be in the build root now. We wanted scripts that write to the build root to work. They wouldn’t if they ran in a tmpdir
f
this all makes sense, and i think it puts us back to having code generated files in
dist
as a solid option, as it's 1) predictable 2) very likely to be in .gitignore 3) easy for users to clear when they don't need it anymore (a good
rm -rf build dist
from time to time never hurt anyone)
👍 1
h
Yeah, I think the
write-codegen
goal is helpful for multiple reasons, and I appreciate your validation that being in
dist/
is good. But still, you pointed out
relocated_files
isn’t going to work how we want with
run
and
repl
, which I want to figure out.
w
have reading all of this now: generated code “must” be under a location like
.pants.d/tmp**
or
dist
for the purposes of `repl`/`run`, so i don’t think the CWD of repl/run are relevant here
f
for "editable codegen" things (like generating client and server stubs from specs, things like that) i think that's another story altogether and it isn't clear to me how something like that would fit into how pants operates
w
the separate goal that “publishes” code-generated stuff into
dist
feels orthogonal to the usecase, if the usecase is `repl`/`run`
f
dist
is better than
.pants.d/tmp**
IMO because i don't have to glob or search to find my code
w
@flat-zoo-31952: you shouldn’t have to regardless if you use codegen
f
i'm not really using codegen, i'm sneaking on the codegen train to add binary downloads
w
the effect of generating ResourceSources would be that they are automatically included on the PYTHONPATH when your code runs
h
you shouldn’t have to regardless if you use codegen
For things that you load via Python’s module system, agreed. You can use
import
and
pkgutil.get_data()
to use things in
.pants.d/tmp1323
without issue. But we want to be able to use
open()
instead. And it’s important imo that calling
open()
on a
relocated_files
target would work with
repl
and
run
, too. Right now, you need to prefix the path with
.pants.d/tmp123
for generated files (not resources) And the only reason you don’t need that prefix for non-generated files is that you’re reading from the actual build root, rather than the
.pants.d
tmpdir
w
@hundreds-father-404: yea… so with regard to FilesSources, how
files
are loaded for
run
is … an open question independent of codegen, right?
f
using
open()
having a real named file is actually pretty relevant in this use case because sometimes C extensions don't play nice with python's stream API
h
so with regard to FilesSources, how files are loaded for run is … an open question independent of codegen, right?
Yep. It works for non codegen, but only because we’re leaking into the actual build root. It fails for any code genned files, including
relocated_files
targets
w
right now, if you were to generate FilesSources, and then run either
test
or
package
, you would get those loose files in a well-known relative path
👍 2
but i don’t think that there is a universal story yet for
run
.
f
yes, i would like the same with repl/run
👍 2
w
makes complete sense. i don’t know if we have a ticket for that… it’s very obviously desirable for
run
because of how parallel it is to
package
… repl is more interesting, but probably still possible.
👍 1
f
i think
repl
is a really compelling usecase for anybody who works with binary data and needs a team to play with it somewhat regularly without clogging up a repo...you can have external programs, collections of files, neural network models, etc, all kinda fake version controlled by virtue of the checksum you have to store in-repo, but that don't need to be downloaded until you have the particular need to work with that target
it's the expanding/contracting monorepo thing without the filesystem virtualization
you'd just have to delete your cache every now and then 😄
w
yea, that makes sense
h
I can, if it’s only describing the issue and constraints. I have no idea what the solution might be
w
yea, that works
thanks!
h
2.1.0rc0 is now out with the new
export-codegen
goal
f
is that related to repl/run, or is that about exporting the results of code generation to filesystem for, say, inclusion in the repo?
h
is that about exporting the results of code generation to filesystem for, say, inclusion in the repo?
This, so that you can access files at
dist/codegen
. The bigger problem still exists at https://github.com/pantsbuild/pants/issues/11109. For example, the above workaround has an issue that if you forget to manually regenerate the files, your `repl`/`run` will be using stale files
w
Yea, you wouldn't want to use this for repl/run: mostly for IDE integration or other cases where an external/non-integrated tool needs to consume generated code. We still need to address repl/run.