What's the difference between pex generated with d...
# general
r
What's the difference between pex generated with different layouts -- zipapp, packed, loose. I understand the differences in terms of how pex is structured in each layout, however I would like to know more about what are the exact scenarios in which one is better over other. The docs on pants website suggests that different layouts offer different "caching and syncing tradeoffs", however its not very clear to me
šŸ‘€ 1
c
Hi,
Copy code
pex --help
May provide a little more details than the pants docs?
--layout {zipapp,packed,loose}
By default, a PEX is created as a single file zipapp when
-o
is specified, but either a packed or loose directory tree based layout can be chosen instead. A packed layout PEX is an executable
directory structure designed to have cache-friendly characteristics for syncing incremental updates to PEXed applications over a network. At the top level of the packed directory tree there is an
executable `__main__.py`script. The directory can also be executed by passing its path to a Python executable; e.g:
python packed-pex-dir/
. The Pex bootstrap code and all dependency code are packed
into individual zip files for efficient caching and syncing. A loose layout PEX is similar to a packed PEX, except that neither the Pex bootstrap code nor the dependency code are packed into zip files,
but are instead present as collections of loose files in the directory tree providing different caching and syncing tradeoffs. Both zipapp and packed layouts install themselves in the PEX_ROOT as loose
apps by default before executing, but these layouts compose with
--venv
execution mode as well and support `--seed`ing. (default: zipapp)
āœ… 1
h
FWIW you often want to use
--packed
and
--venv
in practice, especially if deploying in a docker image. Although
zipapp
is useful if you want a single
.pex
file for ease of deployment of the raw file.
āœ… 1
b
Spelling out the consequences on Pants' caching: • Pants caches process outputs, but each individual file within the output (so a process that outputs a directory with many files will have each file stored individually, rather than a single large blob with the directory and all its files) • The
zipapp
layout is a single large zip file: any change to its contents (even adding a single
.
to a comment in a single source file) will result in a pex with (slightly) different contents, and the whole pex will have to be created from scratch, zipping up all its contents. • As those docs describe, the
packed
/
loose
layouts use more directories, so sub-parts of the pex are stored separately • This can be particularly noticeable when a pex uses large dependencies (e.g. numpy, opencv, pytorch, tensorflow): ā—¦ for `zipapp`: even a tiny change to an input source file will store a new many-megabyte pex in the cache, plus pex will spend more time manipulating the dependencies to put them into that zip ā—¦ for `packed`: a tiny change to a source file won't change the dependencies, and thus the
.whl
s in
.deps/
won't need to be recached (i.e. those files will be deduplicated within the cache), plus pex can just copy them around rather than needing to synthesize a whole new zip ā—¦
loose
is similar to
packed
, although I think can be slower, since it has to unzip all the dependencies and write their contents to disk
šŸŽ‰ 1
r
@broad-processor-92400 thanks for the detailed explanation...its really helpful. Hope this can be added to the pex_binary docs for field
layout
šŸ™
šŸ’Æ 1
w
I think my confusion has ended up being around
loose
- because in the docs, and in practice, it's a bit harder to grok. I recall needing to run a
tree
on the unzipped pex and diffing to try to get a grasp of what was going on. Benjy's comment is basically where I landed too šŸ˜†
This is literally what I've been running locally to figure out what's going on half the time with my deps
Copy code
pex_binary(
    name="bin",
    dependencies=[":lib"],
    entry_point="main.py",
    execution_mode=parametrize("venv", "zipapp"),
    layout=parametrize("loose", "packed", "zipapp"),
)
Actually, interesting - weird, note:
loose
doesn't seem to be cached as I would expect: No modifications,
loose
is still packaging for 1 second (when the other variants are 0.1s)
Copy code
scratch/pants-large % time pants package simple:bin@execution_mode=venv,layout=loose
20:55:12.90 [INFO] Wrote dist/simple/bin@execution_mode=venv,layout=loose.pex
pants package simple:bin@execution_mode=venv,layout=loose  0.01s user 0.01s system 1% cpu 1.043 total

scratch/pants-large % time pants package simple:bin@execution_mode=venv,layout=zipapp
20:55:14.35 [INFO] Wrote dist/simple/bin@execution_mode=venv,layout=zipapp.pex
pants package simple:bin@execution_mode=venv,layout=zipapp  0.01s user 0.01s system 1% cpu 0.103 total

scratch/pants-large % time pants package simple:bin@execution_mode=venv,layout=packed
20:55:18.33 [INFO] Wrote dist/simple/bin@execution_mode=venv,layout=packed.pex
pants package simple:bin@execution_mode=venv,layout=packed  0.01s user 0.01s system 19% cpu 0.109 total
b
For a large pex with a lot of small files in it, it might take that long to delete the old directory within
dist/
and write the new one
w
Ah yeah, that's a safe bet - like 7k files, whereas packed is like 50
I was thinking that was lazily unpacked with a zipapp execution, I guess not 🤷 - edit: as in, I didn't realize unpack time was
package
b
I don't think it's unpacking (at least, not unzipping it) exactly: • The pex invocation will have done the unpacking internally, with its output digest being a directory full of 7k files. • When
pants package
is finalising its work by writing the output digest(s) into
dist
, it has to: ā—¦ first, delete anything that's already there (i.e. potentially 7k files if overwriting an existing package). ā—¦ then, write each file in the digest to disk • You'd potentially see similar behaviour with any set of 7k files (e.g. a shell command that generates many files +
pants export-codegen
), even if the output wasn't conceptually connected to a zip • NB. I don't know the specifics so I could be wrong, but you could get a sense of how much the file manipulation costs with commands like
time rm -rf dist/...
or
time cp -R dist/... /tmp/whatever
.
w
Yeah, I did that - about 300ms to remove the directory (and about 20ms to remove all the other directories from my experiments). So yeah, it must come down to removing and re-materializing all that data
šŸ‘ 1
So, it's a good hypothesis - that all the time spent is just disk I/O basically
I use
scie
s a lot, and I set mine up to unpack on first run - so I'm used to deferring that cost from my
pants
time. Gonna have to remember this... Thanks!