What s the difference between pex generated with different l Pants #general

What's the difference between pex generated with d...

rough-room-65027

06/13/2024, 5:56 AM

What's the difference between pex generated with different layouts -- zipapp, packed, loose. I understand the differences in terms of how pex is structured in each layout, however I would like to know more about what are the exact scenarios in which one is better over other. The docs on pants website suggests that different layouts offer different "caching and syncing tradeoffs", however its not very clear to me

👀 1

curved-television-6568

06/13/2024, 8:16 AM

Hi,

Copy code

pex --help

May provide a little more details than the pants docs?

--layout {zipapp,packed,loose}

By default, a PEX is created as a single file zipapp when
-o
is specified, but either a packed or loose directory tree based layout can be chosen instead. A packed layout PEX is an executable

directory structure designed to have cache-friendly characteristics for syncing incremental updates to PEXed applications over a network. At the top level of the packed directory tree there is an

executable `__main__.py`script. The directory can also be executed by passing its path to a Python executable; e.g:
python packed-pex-dir/
. The Pex bootstrap code and all dependency code are packed

into individual zip files for efficient caching and syncing. A loose layout PEX is similar to a packed PEX, except that neither the Pex bootstrap code nor the dependency code are packed into zip files,

but are instead present as collections of loose files in the directory tree providing different caching and syncing tradeoffs. Both zipapp and packed layouts install themselves in the PEX_ROOT as loose

apps by default before executing, but these layouts compose with
--venv
execution mode as well and support `--seed`ing. (default: zipapp)

✅ 1

happy-kitchen-89482

06/13/2024, 1:48 PM

FWIW you often want to use

--packed

and

--venv

in practice, especially if deploying in a docker image. Although

zipapp

is useful if you want a single

.pex

file for ease of deployment of the raw file.

✅ 1

broad-processor-92400

06/14/2024, 7:37 AM

Spelling out the consequences on Pants' caching: • Pants caches process outputs, but each individual file within the output (so a process that outputs a directory with many files will have each file stored individually, rather than a single large blob with the directory and all its files) • The

zipapp

layout is a single large zip file: any change to its contents (even adding a single

to a comment in a single source file) will result in a pex with (slightly) different contents, and the whole pex will have to be created from scratch, zipping up all its contents. • As those docs describe, the

packed

loose

layouts use more directories, so sub-parts of the pex are stored separately • This can be particularly noticeable when a pex uses large dependencies (e.g. numpy, opencv, pytorch, tensorflow): ◦ for `zipapp`: even a tiny change to an input source file will store a new many-megabyte pex in the cache, plus pex will spend more time manipulating the dependencies to put them into that zip ◦ for `packed`: a tiny change to a source file won't change the dependencies, and thus the

.whl

s in

.deps/

won't need to be recached (i.e. those files will be deduplicated within the cache), plus pex can just copy them around rather than needing to synthesize a whole new zip ◦

loose

is similar to

packed

, although I think can be slower, since it has to unzip all the dependencies and write their contents to disk

🎉 1

rough-room-65027

06/14/2024, 8:41 AM

@broad-processor-92400 thanks for the detailed explanation...its really helpful. Hope this can be added to the pex_binary docs for field

layout

🙏

💯 1

wide-midnight-78598

06/15/2024, 12:28 AM

I think my confusion has ended up being around

loose

- because in the docs, and in practice, it's a bit harder to grok. I recall needing to run a

tree

on the unzipped pex and diffing to try to get a grasp of what was going on. Benjy's comment is basically where I landed too 😆

wide-midnight-78598

06/15/2024, 12:30 AM

This is literally what I've been running locally to figure out what's going on half the time with my deps

Copy code

pex_binary(
    name="bin",
    dependencies=[":lib"],
    entry_point="main.py",
    execution_mode=parametrize("venv", "zipapp"),
    layout=parametrize("loose", "packed", "zipapp"),
)

wide-midnight-78598

06/15/2024, 12:57 AM

Actually, interesting - weird, note:

loose

doesn't seem to be cached as I would expect: No modifications,

loose

is still packaging for 1 second (when the other variants are 0.1s)

Copy code

scratch/pants-large % time pants package simple:bin@execution_mode=venv,layout=loose
20:55:12.90 [INFO] Wrote dist/simple/bin@execution_mode=venv,layout=loose.pex
pants package simple:bin@execution_mode=venv,layout=loose  0.01s user 0.01s system 1% cpu 1.043 total

scratch/pants-large % time pants package simple:bin@execution_mode=venv,layout=zipapp
20:55:14.35 [INFO] Wrote dist/simple/bin@execution_mode=venv,layout=zipapp.pex
pants package simple:bin@execution_mode=venv,layout=zipapp  0.01s user 0.01s system 1% cpu 0.103 total

scratch/pants-large % time pants package simple:bin@execution_mode=venv,layout=packed
20:55:18.33 [INFO] Wrote dist/simple/bin@execution_mode=venv,layout=packed.pex
pants package simple:bin@execution_mode=venv,layout=packed  0.01s user 0.01s system 19% cpu 0.109 total

broad-processor-92400

06/15/2024, 1:01 AM

For a large pex with a lot of small files in it, it might take that long to delete the old directory within

dist/

and write the new one

wide-midnight-78598

06/15/2024, 1:05 AM

Ah yeah, that's a safe bet - like 7k files, whereas packed is like 50

wide-midnight-78598

06/15/2024, 1:06 AM

I was thinking that was lazily unpacked with a zipapp execution, I guess not 🤷 - edit: as in, I didn't realize unpack time was

package

broad-processor-92400

06/15/2024, 1:22 AM

I don't think it's unpacking (at least, not unzipping it) exactly: • The pex invocation will have done the unpacking internally, with its output digest being a directory full of 7k files. • When

pants package

is finalising its work by writing the output digest(s) into

dist

, it has to: ◦ first, delete anything that's already there (i.e. potentially 7k files if overwriting an existing package). ◦ then, write each file in the digest to disk • You'd potentially see similar behaviour with any set of 7k files (e.g. a shell command that generates many files +

pants export-codegen

), even if the output wasn't conceptually connected to a zip • NB. I don't know the specifics so I could be wrong, but you could get a sense of how much the file manipulation costs with commands like

time rm -rf dist/...

time cp -R dist/... /tmp/whatever

wide-midnight-78598

06/15/2024, 1:24 AM

Yeah, I did that - about 300ms to remove the directory (and about 20ms to remove all the other directories from my experiments). So yeah, it must come down to removing and re-materializing all that data

👍 1

wide-midnight-78598

06/15/2024, 1:24 AM

So, it's a good hypothesis - that all the time spent is just disk I/O basically

wide-midnight-78598

06/15/2024, 1:25 AM

I use

scie

s a lot, and I set mine up to unpack on first run - so I'm used to deferring that cost from my

pants

time. Gonna have to remember this... Thanks!

11 Views

Open in Slack

Previous Next