I m trying to add a new target type that can download files Pants #plugins

I'm trying to add a new target type that can downl...

flat-zoo-31952

11/05/2020, 4:20 PM

I'm trying to add a new target type that can download files and use them as

files

-like dependencies...i get how to create the target type, but what kind of object do i need to return in my rule to make it available as a dependency of other targets?

GeneratedSources

maybe?

hundreds-father-404

11/05/2020, 4:46 PM

Yes, codegen is appropriate for this. I’m assuming that this new target will not have actual original source files, which is fine - you can make the

Sources

subclass set

alias = "_sources"

, which causes the field to be ignored in

./pants help

. See https://github.com/pantsbuild/pants/blob/72d0409e61a58d2390390405d3a87a771feea61b/src/python/pants/core/target_types.py#L62-L186 for how we do a very similar setup with the

relocated_files

target

hundreds-father-404

11/05/2020, 4:47 PM

See https://www.pantsbuild.org/docs/rules-api-file-system#downloadfile for how to safely download files with the engine I realized it’s slightly stale, will update now

👍🏻 1

flat-zoo-31952

11/05/2020, 7:41 PM

Awesome I got this working pretty quickly. I see how I can use addprefix to rename the file too. How about adding an action after download, like unzip or untar... Is there a shortcut to doing that that or should I just use a process?

hundreds-father-404

11/05/2020, 7:43 PM

Oh, if it’s an archive file, check out archive.py for pre-built functionality to extract it: https://github.com/pantsbuild/pants/blob/master/src/python/pants/core/util_rules/archive.py https://github.com/pantsbuild/pants/blob/72d0409e61a58d2390390405d3a87a771feea61b/src/python/pants/core/util_rules/external_tool.py#L242-L245 is an example of downloading a file, plus maybe extracting it. (No op if not an archive)

flat-zoo-31952

11/05/2020, 7:46 PM

This is dope

hundreds-father-404

11/05/2020, 7:47 PM

hats off to @happy-kitchen-89482 for writing it all 😄 and John for the

BinaryPaths

abstraction to safely locate binaries like

tar

and

unzip

on your machine

flat-zoo-31952

11/05/2020, 8:17 PM

i'm getting an engine exception with with the

ExtractedFile

get:

Copy code

Exception: WithDeps(Inner(InnerEntry { params: {DownloadFileRequest}, rule: Task(Task { product: GeneratedSources, can_modify_workunit: false, clause: [DownloadFileRequest], gets: [Get { output: PyDigest, input: DownloadFile }, Get { output: ExtractedArchive, input: PyDigest }, Get { output: Snapshot, input: PyDigest }], func: jumiopants.download_files:59:download_file(), cacheable: true, display_info: DisplayInfo { name: "jumiopants.download_files.download_file", desc: None, level: Trace } }) })) did not declare a dependency on JustGet(Get { output: Snapshot, input: ExtractedArchive })

hundreds-father-404

11/05/2020, 8:18 PM

Are you on 2.1.0.dev1? This error message was recently improved

flat-zoo-31952

11/05/2020, 8:18 PM

no i'm still on 2.0.0

hundreds-father-404

11/05/2020, 8:19 PM

Okay, sg. In the meantime, this error message is when the input to

Get

is different than what’s expected. If you have

Get(Snapshot, ExtractedArchive, "not an extracted archive object")

for example

hundreds-father-404

11/05/2020, 8:20 PM

Do you have this in your code:

await Get(Snapshot, Digest, my_extracted_archive)

flat-zoo-31952

11/05/2020, 8:20 PM

okay i kinda get that from what's being said

flat-zoo-31952

11/05/2020, 8:20 PM

yeah i forgot how types work for a second apparently

hundreds-father-404

11/05/2020, 8:21 PM

okay i kinda get that from what’s being said

It’s a horrible horrible error message. 2.1.0.dev0 revamps it all

flat-zoo-31952

11/05/2020, 8:21 PM

reminds me of haskell lol

😂 1

flat-zoo-31952

11/05/2020, 8:22 PM

tons of expressive power; terrible error messages

👍 1

hundreds-father-404

11/05/2020, 8:24 PM

Hopefully we can graduate to Rust: tons of expressive power, fantastic error messages Big remaining kahuna is when it’s an invalid rule graph

🦍 1

flat-zoo-31952

11/05/2020, 8:53 PM

is there a way to keep the generated file in view for a

repl

goal?

flat-zoo-31952

11/05/2020, 8:54 PM

it seems to show up relative to the process dir root in

run

goals, but for repl it's buried in

.pants.d/

flat-zoo-31952

11/05/2020, 9:06 PM

actually i lied...where does this show up for a

run

goal file?

hundreds-father-404

11/05/2020, 9:08 PM

That’s what I was confused about. Both run and repl create a tempdir in

.pants.d

, and load the contents on

PYTHONPATH

(via

PEX_EXTRA_SYS_PATH

) iiuc. We use a tmpdir so that we don’t materialize things like codegen into your build root. Unlike normally using a tmpdir in

/tmp

, we use

.pants.d

so that you can reach into files in your build root if you want, since both

run

and

repl

are never cached and we reason it’s safer to be less hermetic

hundreds-father-404

11/05/2020, 9:08 PM

So, to answer your question, I would expect the file to be somewhere like

.pants.d/tmp139131/src/python/project/asset.zip

hundreds-father-404

11/05/2020, 9:09 PM

See https://pantsbuild.slack.com/archives/C046T6T9U/p1604545674208100?thread_ts=1604262956.144200&cid=C046T6T9U about a new

write-codegen

goal I prototyped last night. It could be relevant here, but has a possible dangerous consequence if implemented the way I did I would love a) if you’d like that goal to begin with, and b) which of those two solutions sound better to you

flat-zoo-31952

11/05/2020, 9:30 PM

i think 2 makes way more sense and is a lot simpler

hundreds-father-404

11/05/2020, 9:30 PM

Cool, I agree. Thanks! Then this feature should be good to go for 2.1.0.dev1 tomorrow 🙂

flat-zoo-31952

11/05/2020, 9:31 PM

my use case is about providing a way to interact with external artifacts, like vendor-provided binaries, or ML model artifacts, but without having to commit them

👍 1

flat-zoo-31952

11/05/2020, 9:33 PM

so what i need is for them to be at a predictable path when running the relevant goal

flat-zoo-31952

11/05/2020, 9:35 PM

hmm is there a way to turn a

files

target into a

resources

target?

flat-zoo-31952

11/05/2020, 9:35 PM

i would only want to do this for local execution targets, not for things i mean to package, but it might be a more straightforward way of handling this

hundreds-father-404

11/05/2020, 9:36 PM

Would it work to have your new codegen rule output

ResourcesSources

, rather than

FilesSources

flat-zoo-31952

11/05/2020, 9:47 PM

I can try that

flat-zoo-31952

11/05/2020, 9:48 PM

i think the issue is that i would need both, and that might be ambiguous for the rule graph

hundreds-father-404

11/05/2020, 9:50 PM

Taking a step back, what’s the objective with using

resources

rather than

files

? They’re very similar, only that

resources

will have the source root included in the PYTHONPATH. (https://www.pantsbuild.org/docs/resources describes this more)

flat-zoo-31952

11/05/2020, 9:55 PM

to not have to think about loading file paths in these goals

hundreds-father-404

11/05/2020, 9:56 PM

So, you’d load them with something like

pkgutil.get_data()

and not need to worry about

.pants.d

, rather than using

open()

, right?

hundreds-father-404

11/05/2020, 9:57 PM

(This might reveal that

relocated_files

won’t work how we want with those two goals. I’ve never understood very well running

repl

and

run

.pants.d

, and I wonder if it was a mistake and we should run in a clean tmpdir like normal)

flat-zoo-31952

11/05/2020, 9:59 PM

Let me explain a bit more...in production, we use an environment variable called

MODEL_PATH

to tell ML inference code where to find model artifacts. These models can be COPY'd into docker images, mounted in there, or loaded via other processes, depending on the deployment. Unfortunately, we don't currently have a good way of emulating this locally to experiment or test (run/repl/test goals, maybe raul's jupyter goal would be nice too).

flat-zoo-31952

11/05/2020, 10:00 PM

my idea was to use a

python_library(name="model-loaded")

target that depends on both the inference code's

python_library(name="lib")

target and the

download_file(name="model-v1")

target i'm adding for this artifact

flat-zoo-31952

11/05/2020, 10:04 PM

this would work so that, say,

./pants repl path/to/build:model-loaded

would download the model and insert it into some predictable path within the execution environment so that someone could load it. I think i'd actually rather this happen via

open()

than

pkgutil

, but I need some way for the execution process to have access to that information

👍 1

flat-zoo-31952

11/05/2020, 10:05 PM

that could work via some kind of environment variable or just a well-known path

flat-zoo-31952

11/05/2020, 10:07 PM

i guess right now i can do

Path('.pants.d').glob('**/my_file')

but that's clearly a hack

hundreds-father-404

11/05/2020, 10:11 PM

Yeah, I personally think this a smell that having

run

and

repl

run in the build root under a tmpdir isn’t worth it. There’s this issue you’re running into with code generated

files

, and there’s also an issue that it’s confusing to have both

.pants.d/tmp131/src/python/project/app.py

and normal

src/python/project/app.py

. From the user’s perspective, how are those different? We went with running in the build root under the idea that

repl

is more experimental, so it’s fine to break hermiticity. But I don’t know if that is worth it. And we don’t use that logic with

test --debug

cc @happy-kitchen-89482 what do you think? @hundreds-breakfast-49010 and @witty-crayon-22786 too

witty-crayon-22786

11/05/2020, 10:16 PM

Will review shortly. But having read only Eric's last so far: anything that is intended to be edited should live in the repo, and be edited there for the purposes of live-reload usecases

hundreds-father-404

11/05/2020, 10:17 PM

Oh, hm, I remember a reason we had

run

be in the build root now. We wanted scripts that write to the build root to work. They wouldn’t if they ran in a tmpdir

flat-zoo-31952

11/05/2020, 10:24 PM

this all makes sense, and i think it puts us back to having code generated files in

dist

as a solid option, as it's 1) predictable 2) very likely to be in .gitignore 3) easy for users to clear when they don't need it anymore (a good

rm -rf build dist

from time to time never hurt anyone)

👍 1

hundreds-father-404

11/05/2020, 10:26 PM

Yeah, I think the

write-codegen

goal is helpful for multiple reasons, and I appreciate your validation that being in

dist/

is good. But still, you pointed out

relocated_files

isn’t going to work how we want with

run

and

repl

, which I want to figure out.

witty-crayon-22786

11/05/2020, 10:27 PM

have reading all of this now: generated code “must” be under a location like

.pants.d/tmp**

dist

for the purposes of `repl`/`run`, so i don’t think the CWD of repl/run are relevant here

flat-zoo-31952

11/05/2020, 10:27 PM

for "editable codegen" things (like generating client and server stubs from specs, things like that) i think that's another story altogether and it isn't clear to me how something like that would fit into how pants operates

witty-crayon-22786

11/05/2020, 10:27 PM

the separate goal that “publishes” code-generated stuff into

dist

feels orthogonal to the usecase, if the usecase is `repl`/`run`

flat-zoo-31952

11/05/2020, 10:28 PM

dist

is better than

.pants.d/tmp**

IMO because i don't have to glob or search to find my code

witty-crayon-22786

11/05/2020, 10:28 PM

@flat-zoo-31952: you shouldn’t have to regardless if you use codegen

flat-zoo-31952

11/05/2020, 10:29 PM

i'm not really using codegen, i'm sneaking on the codegen train to add binary downloads

witty-crayon-22786

11/05/2020, 10:29 PM

the effect of generating ResourceSources would be that they are automatically included on the PYTHONPATH when your code runs

hundreds-father-404

11/05/2020, 10:30 PM

you shouldn’t have to regardless if you use codegen

For things that you load via Python’s module system, agreed. You can use

import

and

pkgutil.get_data()

to use things in

.pants.d/tmp1323

without issue. But we want to be able to use

open()

instead. And it’s important imo that calling

open()

on a

relocated_files

target would work with

repl

and

run

, too. Right now, you need to prefix the path with

.pants.d/tmp123

for generated files (not resources) And the only reason you don’t need that prefix for non-generated files is that you’re reading from the actual build root, rather than the

.pants.d

tmpdir

witty-crayon-22786

11/05/2020, 10:31 PM

@hundreds-father-404: yea… so with regard to FilesSources, how

files

are loaded for

run

is … an open question independent of codegen, right?

flat-zoo-31952

11/05/2020, 10:31 PM

using
open()
having a real named file is actually pretty relevant in this use case because sometimes C extensions don't play nice with python's stream API

hundreds-father-404

11/05/2020, 10:32 PM

so with regard to FilesSources, how files are loaded for run is … an open question independent of codegen, right?

Yep. It works for non codegen, but only because we’re leaking into the actual build root. It fails for any code genned files, including

relocated_files

targets

witty-crayon-22786

11/05/2020, 10:32 PM

right now, if you were to generate FilesSources, and then run either

test

package

, you would get those loose files in a well-known relative path

👍 2

witty-crayon-22786

11/05/2020, 10:32 PM

but i don’t think that there is a universal story yet for

run

flat-zoo-31952

11/05/2020, 10:32 PM

yes, i would like the same with repl/run

👍 2

witty-crayon-22786

11/05/2020, 10:33 PM

makes complete sense. i don’t know if we have a ticket for that… it’s very obviously desirable for

run

because of how parallel it is to

package

… repl is more interesting, but probably still possible.

👍 1

flat-zoo-31952

11/05/2020, 10:38 PM

i think

repl

is a really compelling usecase for anybody who works with binary data and needs a team to play with it somewhat regularly without clogging up a repo...you can have external programs, collections of files, neural network models, etc, all kinda fake version controlled by virtue of the checksum you have to store in-repo, but that don't need to be downloaded until you have the particular need to work with that target

flat-zoo-31952

11/05/2020, 10:39 PM

it's the expanding/contracting monorepo thing without the filesystem virtualization

flat-zoo-31952

11/05/2020, 10:39 PM

you'd just have to delete your cache every now and then 😄

witty-crayon-22786

11/05/2020, 10:57 PM

yea, that makes sense

witty-crayon-22786

11/05/2020, 11:03 PM

@hundreds-father-404: do you want to create this one, or should i ? https://pantsbuild.slack.com/archives/C01CQHVDMMW/p1604615614024800?thread_ts=1604593256.009300&cid=C01CQHVDMMW

hundreds-father-404

11/05/2020, 11:04 PM

I can, if it’s only describing the issue and constraints. I have no idea what the solution might be

witty-crayon-22786

11/05/2020, 11:46 PM

yea, that works

witty-crayon-22786

11/05/2020, 11:47 PM

thanks!

hundreds-father-404

11/06/2020, 10:38 PM

https://github.com/pantsbuild/pants/issues/11109

👍 1

hundreds-father-404

11/10/2020, 3:00 AM

2.1.0rc0 is now out with the new

export-codegen

goal

flat-zoo-31952

11/10/2020, 3:04 AM

is that related to repl/run, or is that about exporting the results of code generation to filesystem for, say, inclusion in the repo?

hundreds-father-404

11/10/2020, 3:07 AM

is that about exporting the results of code generation to filesystem for, say, inclusion in the repo?

This, so that you can access files at

dist/codegen

. The bigger problem still exists at https://github.com/pantsbuild/pants/issues/11109. For example, the above workaround has an issue that if you forget to manually regenerate the files, your `repl`/`run` will be using stale files

witty-crayon-22786

11/10/2020, 3:38 AM

Yea, you wouldn't want to use this for repl/run: mostly for IDE integration or other cases where an external/non-integrated tool needs to consume generated code. We still need to address repl/run.

2 Views

Open in Slack

Previous Next