Another question When writing a plugin I want to basically d Pants #general

Another question: When writing a plugin, I want to...

better-sunset-63499

02/15/2022, 6:16 PM

Another question: When writing a plugin, I want to basically do this:

Copy code

import zipfile 

def myrule(input: artifact....):
     with (zipfile.open(input.relative_path, 'r') as inzip, zipfile.open(input.output_path, 'w') as outzip):
        for f in inzip.files:
            outzip.write(f.name.replace('someprefix/', ''))

I see the

Process

object as a way to execute a shell command, but in this case I can do everything I want in no-deps python. Do I need to write the script and use

Process(['python3', 'myscript.py']....)

or is there an easier way?

enough-analyst-54434

02/15/2022, 6:18 PM

There is no easier way. Consider when you turn on remoting and the fact that remoting doesn't ship rule code around, just args and env.

enough-analyst-54434

02/15/2022, 6:19 PM

Its perhaps confusing that rule code is written in Python. It looks like you should be able to do all Python things, but you really can't. Any "Real" work pretty much has to be in a Process.

hundreds-father-404

02/15/2022, 6:21 PM

Yeah but remoting would still have

zipfile

from the std-lib. I don't think using

zipfile

to do processing is an issue? Rather, the issue is using

zipfile.open()

to do a filesystem operation. You should be using the engine for reading in a file, e.g.

Get(DigestContents, PathGlobs(...))

https://www.pantsbuild.org/docs/rules-api-file-system. Otherwise file watching / invalidation won't work properly. That is, use the engine to get the raw bytes, then go ahead using the non-side-effecty parts of

zipfile

enough-analyst-54434

02/15/2022, 6:21 PM

In this particular case you could get this to work by creating a digest with the in-mem contents of the zip you build up.

enough-analyst-54434

02/15/2022, 6:21 PM

But you really don;t want to do that.

enough-analyst-54434

02/15/2022, 6:21 PM

A short answer is you should avoid an IO done by python. All IO should be done by the engine.

➕ 1

better-sunset-63499

02/15/2022, 6:23 PM

I think I'm just struggling to understand the "write my rule in python" and my "write the tool to do the thing my rule does" in python

better-sunset-63499

02/15/2022, 6:23 PM

and where the line is

enough-analyst-54434

02/15/2022, 6:23 PM

In python you can only manipulate datastructures.

enough-analyst-54434

02/15/2022, 6:23 PM

That is really all. Create a dict, iterate a thing await a Get, etc.

enough-analyst-54434

02/15/2022, 6:25 PM

You can get away with more. We parse json in python to interpret Process outputs for example. So it does get fuzzy. But the IO hard line is the one to watch, Parse json of bytes provided by engine? OK. Get the bytes of the json from disk or an http call in Python? not OK.

👍 1

hundreds-father-404

02/15/2022, 6:25 PM

In python you can only manipulate datastructures.

I don't think that's true. There are plenty of examples of doing processing in-Python, like using regex for dependency inference. The key is you can't do IO-unsafe things without the engine, like reading file system or making network requests. Otherwise, it's a choice between the benefits of doing in-memory vs. via

Process

. Where a major benefit of using

Process

is you get long-lived caching on-disk, rather than only memoization from the Pants daemon

better-sunset-63499

02/15/2022, 6:27 PM

maybe I'm going about this wrong. My goal is: given an artifact zip, I want to move files around and remove ones matching a specific group of globs. The zip coming in has a structure like

/parent/child/things{.txt,json,csv}

and I want to move them to be

things.json

in the top level, and ignore the txt and csv files

enough-analyst-54434

02/15/2022, 6:27 PM

Yeah, perhaps the most non-obvious thing about the engine is a design choice that could be different. Only

Process

results are truly cached forever and can be remoted. Everything else runs every time pantsd starts up anew.

better-sunset-63499

02/15/2022, 6:27 PM

to me, it feels like I want a rule that takes the zip, plus a group of globs&excludes, then outputs a new zip

enough-analyst-54434

02/15/2022, 6:27 PM

Yup.

enough-analyst-54434

02/15/2022, 6:28 PM

So you can either: 1. Read in the zip from a byte buffer provided by the engine and write it out to a byte buffer you hand back to the engine 2. Write a Process to do this.

👍 1

enough-analyst-54434

02/15/2022, 6:28 PM

1 is convenient but bad for big zips.

enough-analyst-54434

02/15/2022, 6:29 PM

It won't work well for a 2GB zip say.

enough-analyst-54434

02/15/2022, 6:30 PM

And 2 can be cached forever and remoted in a build farm, 1 can't.

hundreds-father-404

02/15/2022, 6:30 PM

Here, probably easiest is use Pants's builtin functionality to unzip for you (see

core/util_rules/archive.py

), then you'll get a

Digest

. Now, use

DigestEntries

to get a handle to each file in the exploded zip, without reading in the bytes to memory (i.e. not

DigestContents

). Combine that with

CreateDigest

to get back what you want. https://www.pantsbuild.org/docs/rules-api-file-system#digestentries

enough-analyst-54434

02/15/2022, 6:31 PM

Yeah, thats the best option here.

better-sunset-63499

02/15/2022, 6:33 PM

And if I want to modify the file content, that's not something I could do via digest if I wanted, but not recommended for big stuff?

better-sunset-63499

02/15/2022, 6:33 PM

e.g. collapsing json from indented to compact JSON

better-sunset-63499

02/15/2022, 6:36 PM

or by doing so would I be entirely going the wrong way, and for things that read/modify file content I should do the

Process

hundreds-father-404

02/15/2022, 6:37 PM

If you want to do that in-memory rather than via a process, I think I'd recommend first making the Digest smaller by using the above instructions with

Get(Digest, CreateDigest(...))

. Then, you can use

DigestContents

to get the raw bytes of the single file, do your Python manipulation of it, and finally call

Get(Digest, CreateDigest(...))

one last time with the modified file Downsides to doing it in-memory are: 1. if the file is very large, materializing it to memory isn't great 2. it won't be cached to disk, only memoized to memory. so when pantsd restarts, you'll redo the work

better-sunset-63499

02/15/2022, 6:41 PM

That makes sense. And the way I would do the

Process

version if my code doing the manipulation were in Python would be like

src/python/pants/backend/terraform/dependency_inference.py

where I'd have a standalone script available in

pkg_resources

and use

VenvPex

to build that to a command I eventually run under

Process

right?

Copy code

@dataclass(frozen=True)
class ParserSetup:
    pex: VenvPex


@rule
async def setup_parser(hcl2_parser: TerraformHcl2Parser) -> ParserSetup:
    parser_script_content = pkgutil.get_data("pants.backend.terraform", "hcl2_parser.py")

better-sunset-63499

02/15/2022, 6:41 PM

in my case the files are fairly large, so having it in cache sounds really beneficial

fast-nail-55400

02/15/2022, 6:43 PM

and you could combine both approaches, manipulate a

Digest

in engine rules to just the files that need to be processed, and then invoke your Python script to do the JSON processing, then recombine the output of that

Process

into a

Digest

with whatever files it needs.

💯 1

fast-nail-55400

02/15/2022, 6:45 PM

thinking about how it should work in remote execution mode might be helpful. i.e., how to act on the data “from a distance”. even locally, engine rules should still try to “act at a distance”

better-sunset-63499

02/15/2022, 6:46 PM

and the savings from combining both would be that I transfer less data, right?

👍 1

fast-nail-55400

02/15/2022, 6:47 PM

digest manipulation doesn’t actually need to ship data usually. if the files are already in the cache, then constructing a “new” digest is just a creating one or more

Directory

protos with references to their digests.

better-sunset-63499

02/15/2022, 6:48 PM

and so then I would avoid transferring digests I ultimately will exclude with a regex to the worker?

better-sunset-63499

02/15/2022, 6:48 PM

since that's in-mem?

fast-nail-55400

02/15/2022, 6:50 PM

Could you clarify that question? I’m confused as to what stage of your proposed processing “exclude with a regex to the worker” refers to.

fast-nail-55400

02/15/2022, 6:51 PM

ah looking back in thread, you want to remove files matching globs/regex?

hundreds-father-404

02/15/2022, 6:52 PM

I think this might be what you're asking. Yes, by pre-processing the zip file to be minimal via

core.util_rules.archive

DigestEntries + CreateDigest

, you will make the

Process

for you to process the single

.json

file more efficient. For example, less time to set up the sandbox (tmp directory)

better-sunset-63499

02/15/2022, 6:55 PM

yep that's right @fast-nail-55400, I want to remove certain files (and if I do that before the next processing step, less transfers) and do some work on others. Thanks Eric, that makes sense

enough-analyst-54434

02/15/2022, 7:08 PM

I wouldn;t get too hung up on efficiences yet. I'm pretty sure there were glosses there. If you 1st unzip using Pants rules, that will turn 1 big zip file into the 10k files within and each will be stored seprately in the cache. After that operations on remote hosts may or may not hit the cache on those individual 10k entries so there may be some shipping around, but that is geting titchy at that point and you really won't be able to guess well what's going on in any given remoting implementation.

👍 1

enough-analyst-54434

02/15/2022, 7:10 PM

IOW unzipping using a Process (the pants rules do this) is the big bang here - 1 zip to 10k digests. All the rest is titchy math.

hundreds-father-404

02/15/2022, 7:11 PM

Yeah that's some good wisdom to not prematurely optimize. You can always add the optimization later

✅ 1

enough-analyst-54434

02/15/2022, 7:12 PM

And the tough thing here is there is not 1 optimization target. The local lmdb cache with 10k entries and your rules that act on them will have different perf characterisitics than said-same and remoting backend 1 and said same and remoting backend 2.

enough-analyst-54434

02/15/2022, 7:15 PM

To whit, I think its fair to say we're still shaking out the perf characterisitcs of the rules we ship wrt local vs remoting perf, etc.

4 Views

Open in Slack

Previous Next