Another question: When writing a plugin, I want to...
# general
b
Another question: When writing a plugin, I want to basically do this:
Copy code
import zipfile 

def myrule(input: artifact....):
     with (zipfile.open(input.relative_path, 'r') as inzip, zipfile.open(input.output_path, 'w') as outzip):
        for f in inzip.files:
            outzip.write(f.name.replace('someprefix/', ''))
I see the
Process
object as a way to execute a shell command, but in this case I can do everything I want in no-deps python. Do I need to write the script and use
Process(['python3', 'myscript.py']....)
or is there an easier way?
e
There is no easier way. Consider when you turn on remoting and the fact that remoting doesn't ship rule code around, just args and env.
Its perhaps confusing that rule code is written in Python. It looks like you should be able to do all Python things, but you really can't. Any "Real" work pretty much has to be in a Process.
h
Yeah but remoting would still have
zipfile
from the std-lib. I don't think using
zipfile
to do processing is an issue? Rather, the issue is using
zipfile.open()
to do a filesystem operation. You should be using the engine for reading in a file, e.g.
Get(DigestContents, PathGlobs(...))
https://www.pantsbuild.org/docs/rules-api-file-system. Otherwise file watching / invalidation won't work properly. That is, use the engine to get the raw bytes, then go ahead using the non-side-effecty parts of
zipfile
e
In this particular case you could get this to work by creating a digest with the in-mem contents of the zip you build up.
But you really don;t want to do that.
A short answer is you should avoid an IO done by python. All IO should be done by the engine.
1
b
I think I'm just struggling to understand the "write my rule in python" and my "write the tool to do the thing my rule does" in python
and where the line is
e
In python you can only manipulate datastructures.
That is really all. Create a dict, iterate a thing await a Get, etc.
You can get away with more. We parse json in python to interpret Process outputs for example. So it does get fuzzy. But the IO hard line is the one to watch, Parse json of bytes provided by engine? OK. Get the bytes of the json from disk or an http call in Python? not OK.
👍 1
h
In python you can only manipulate datastructures.
I don't think that's true. There are plenty of examples of doing processing in-Python, like using regex for dependency inference. The key is you can't do IO-unsafe things without the engine, like reading file system or making network requests. Otherwise, it's a choice between the benefits of doing in-memory vs. via
Process
. Where a major benefit of using
Process
is you get long-lived caching on-disk, rather than only memoization from the Pants daemon
b
maybe I'm going about this wrong. My goal is: given an artifact zip, I want to move files around and remove ones matching a specific group of globs. The zip coming in has a structure like
/parent/child/things{.txt,json,csv}
and I want to move them to be
things.json
in the top level, and ignore the txt and csv files
e
Yeah, perhaps the most non-obvious thing about the engine is a design choice that could be different. Only
Process
results are truly cached forever and can be remoted. Everything else runs every time pantsd starts up anew.
b
to me, it feels like I want a rule that takes the zip, plus a group of globs&excludes, then outputs a new zip
e
Yup.
So you can either: 1. Read in the zip from a byte buffer provided by the engine and write it out to a byte buffer you hand back to the engine 2. Write a Process to do this.
👍 1
1 is convenient but bad for big zips.
It won't work well for a 2GB zip say.
And 2 can be cached forever and remoted in a build farm, 1 can't.
h
Here, probably easiest is use Pants's builtin functionality to unzip for you (see
core/util_rules/archive.py
), then you'll get a
Digest
. Now, use
DigestEntries
to get a handle to each file in the exploded zip, without reading in the bytes to memory (i.e. not
DigestContents
). Combine that with
CreateDigest
to get back what you want. https://www.pantsbuild.org/docs/rules-api-file-system#digestentries
e
Yeah, thats the best option here.
b
And if I want to modify the file content, that's not something I could do via digest if I wanted, but not recommended for big stuff?
e.g. collapsing json from indented to compact JSON
or by doing so would I be entirely going the wrong way, and for things that read/modify file content I should do the
Process
?
h
If you want to do that in-memory rather than via a process, I think I'd recommend first making the Digest smaller by using the above instructions with
Get(Digest, CreateDigest(...))
. Then, you can use
DigestContents
to get the raw bytes of the single file, do your Python manipulation of it, and finally call
Get(Digest, CreateDigest(...))
one last time with the modified file Downsides to doing it in-memory are: 1. if the file is very large, materializing it to memory isn't great 2. it won't be cached to disk, only memoized to memory. so when pantsd restarts, you'll redo the work
b
That makes sense. And the way I would do the
Process
version if my code doing the manipulation were in Python would be like
src/python/pants/backend/terraform/dependency_inference.py
where I'd have a standalone script available in
pkg_resources
and use
VenvPex
to build that to a command I eventually run under
Process
right?
Copy code
@dataclass(frozen=True)
class ParserSetup:
    pex: VenvPex


@rule
async def setup_parser(hcl2_parser: TerraformHcl2Parser) -> ParserSetup:
    parser_script_content = pkgutil.get_data("pants.backend.terraform", "hcl2_parser.py")
in my case the files are fairly large, so having it in cache sounds really beneficial
f
and you could combine both approaches, manipulate a
Digest
in engine rules to just the files that need to be processed, and then invoke your Python script to do the JSON processing, then recombine the output of that
Process
into a
Digest
with whatever files it needs.
💯 1
thinking about how it should work in remote execution mode might be helpful. i.e., how to act on the data “from a distance”. even locally, engine rules should still try to “act at a distance”
b
and the savings from combining both would be that I transfer less data, right?
👍 1
f
digest manipulation doesn’t actually need to ship data usually. if the files are already in the cache, then constructing a “new” digest is just a creating one or more
Directory
protos with references to their digests.
b
and so then I would avoid transferring digests I ultimately will exclude with a regex to the worker?
since that's in-mem?
f
Could you clarify that question? I’m confused as to what stage of your proposed processing “exclude with a regex to the worker” refers to.
ah looking back in thread, you want to remove files matching globs/regex?
h
I think this might be what you're asking. Yes, by pre-processing the zip file to be minimal via
core.util_rules.archive
->
DigestEntries + CreateDigest
, you will make the
Process
for you to process the single
.json
file more efficient. For example, less time to set up the sandbox (tmp directory)
b
yep that's right @fast-nail-55400, I want to remove certain files (and if I do that before the next processing step, less transfers) and do some work on others. Thanks Eric, that makes sense
e
I wouldn;t get too hung up on efficiences yet. I'm pretty sure there were glosses there. If you 1st unzip using Pants rules, that will turn 1 big zip file into the 10k files within and each will be stored seprately in the cache. After that operations on remote hosts may or may not hit the cache on those individual 10k entries so there may be some shipping around, but that is geting titchy at that point and you really won't be able to guess well what's going on in any given remoting implementation.
👍 1
IOW unzipping using a Process (the pants rules do this) is the big bang here - 1 zip to 10k digests. All the rest is titchy math.
h
Yeah that's some good wisdom to not prematurely optimize. You can always add the optimization later
1
e
And the tough thing here is there is not 1 optimization target. The local lmdb cache with 10k entries and your rules that act on them will have different perf characterisitics than said-same and remoting backend 1 and said same and remoting backend 2.
To whit, I think its fair to say we're still shaking out the perf characterisitcs of the rules we ship wrt local vs remoting perf, etc.