better-sunset-63499
02/15/2022, 6:16 PMimport zipfile
def myrule(input: artifact....):
with (zipfile.open(input.relative_path, 'r') as inzip, zipfile.open(input.output_path, 'w') as outzip):
for f in inzip.files:
outzip.write(f.name.replace('someprefix/', ''))
I see the Process
object as a way to execute a shell command, but in this case I can do everything I want in no-deps python. Do I need to write the script and use Process(['python3', 'myscript.py']....)
or is there an easier way?enough-analyst-54434
02/15/2022, 6:18 PMenough-analyst-54434
02/15/2022, 6:19 PMhundreds-father-404
02/15/2022, 6:21 PMzipfile
from the std-lib. I don't think using zipfile
to do processing is an issue?
Rather, the issue is using zipfile.open()
to do a filesystem operation. You should be using the engine for reading in a file, e.g. Get(DigestContents, PathGlobs(...))
https://www.pantsbuild.org/docs/rules-api-file-system. Otherwise file watching / invalidation won't work properly. That is, use the engine to get the raw bytes, then go ahead using the non-side-effecty parts of zipfile
enough-analyst-54434
02/15/2022, 6:21 PMenough-analyst-54434
02/15/2022, 6:21 PMenough-analyst-54434
02/15/2022, 6:21 PMbetter-sunset-63499
02/15/2022, 6:23 PMbetter-sunset-63499
02/15/2022, 6:23 PMenough-analyst-54434
02/15/2022, 6:23 PMenough-analyst-54434
02/15/2022, 6:23 PMenough-analyst-54434
02/15/2022, 6:25 PMhundreds-father-404
02/15/2022, 6:25 PMIn python you can only manipulate datastructures.I don't think that's true. There are plenty of examples of doing processing in-Python, like using regex for dependency inference. The key is you can't do IO-unsafe things without the engine, like reading file system or making network requests. Otherwise, it's a choice between the benefits of doing in-memory vs. via
Process
. Where a major benefit of using Process
is you get long-lived caching on-disk, rather than only memoization from the Pants daemonbetter-sunset-63499
02/15/2022, 6:27 PM/parent/child/things{.txt,json,csv}
and I want to move them to be things.json
in the top level, and ignore the txt and csv filesenough-analyst-54434
02/15/2022, 6:27 PMProcess
results are truly cached forever and can be remoted. Everything else runs every time pantsd starts up anew.better-sunset-63499
02/15/2022, 6:27 PMenough-analyst-54434
02/15/2022, 6:27 PMenough-analyst-54434
02/15/2022, 6:28 PMenough-analyst-54434
02/15/2022, 6:28 PMenough-analyst-54434
02/15/2022, 6:29 PMenough-analyst-54434
02/15/2022, 6:30 PMhundreds-father-404
02/15/2022, 6:30 PMcore/util_rules/archive.py
), then you'll get a Digest
.
Now, use DigestEntries
to get a handle to each file in the exploded zip, without reading in the bytes to memory (i.e. not DigestContents
). Combine that with CreateDigest
to get back what you want. https://www.pantsbuild.org/docs/rules-api-file-system#digestentriesenough-analyst-54434
02/15/2022, 6:31 PMbetter-sunset-63499
02/15/2022, 6:33 PMbetter-sunset-63499
02/15/2022, 6:33 PMbetter-sunset-63499
02/15/2022, 6:36 PMProcess
?hundreds-father-404
02/15/2022, 6:37 PMGet(Digest, CreateDigest(...))
. Then, you can use DigestContents
to get the raw bytes of the single file, do your Python manipulation of it, and finally call Get(Digest, CreateDigest(...))
one last time with the modified file
Downsides to doing it in-memory are:
1. if the file is very large, materializing it to memory isn't great
2. it won't be cached to disk, only memoized to memory. so when pantsd restarts, you'll redo the workbetter-sunset-63499
02/15/2022, 6:41 PMProcess
version if my code doing the manipulation were in Python would be like src/python/pants/backend/terraform/dependency_inference.py
where I'd have a standalone script available in pkg_resources
and use VenvPex
to build that to a command I eventually run under Process
right?
@dataclass(frozen=True)
class ParserSetup:
pex: VenvPex
@rule
async def setup_parser(hcl2_parser: TerraformHcl2Parser) -> ParserSetup:
parser_script_content = pkgutil.get_data("pants.backend.terraform", "hcl2_parser.py")
better-sunset-63499
02/15/2022, 6:41 PMfast-nail-55400
02/15/2022, 6:43 PMDigest
in engine rules to just the files that need to be processed, and then invoke your Python script to do the JSON processing, then recombine the output of that Process
into a Digest
with whatever files it needs.fast-nail-55400
02/15/2022, 6:45 PMbetter-sunset-63499
02/15/2022, 6:46 PMfast-nail-55400
02/15/2022, 6:47 PMDirectory
protos with references to their digests.better-sunset-63499
02/15/2022, 6:48 PMbetter-sunset-63499
02/15/2022, 6:48 PMfast-nail-55400
02/15/2022, 6:50 PMfast-nail-55400
02/15/2022, 6:51 PMhundreds-father-404
02/15/2022, 6:52 PMcore.util_rules.archive
-> DigestEntries + CreateDigest
, you will make the Process
for you to process the single .json
file more efficient. For example, less time to set up the sandbox (tmp directory)better-sunset-63499
02/15/2022, 6:55 PMenough-analyst-54434
02/15/2022, 7:08 PMenough-analyst-54434
02/15/2022, 7:10 PMhundreds-father-404
02/15/2022, 7:11 PMenough-analyst-54434
02/15/2022, 7:12 PMenough-analyst-54434
02/15/2022, 7:15 PM