I'm following "pex in a container" <here>. But I'm...
# general
b
I'm following "pex in a container" here. But I'm seeing the stage for the 3rdparty unpacking not re-used when only changing 1stparty like suggested. I'm assuming this is because the PEX for the 3rdparty stage isn't re-used because it also contains the 1srtparty code. Am I understanding this correctly? Is there a way to have re-use (like mapping the docker build context's PEX, not copying)?
e
It should be the case that if 3rdparty is not re-used, then unpacking the PEX with zip will reveal different contents in the root
.deps/
dir.
b
I suppose another slice of this is the pants docker plugin
e
So, that's the gut check. Assuming
.deps/
has changed, you have a Pants problem, if not, you have a Pex problem.
Yeah, this could be in the Pants docker support. The zip comparison should help isolate this quickly.
Is your
pex_binary
target depending on a
python_distribution
? That'd do it.
That's the one designed in path for 1st party to show up as 3rdparty I'm aware of.
b
No, don't think so. I'm leaning towards Pants on this one as the causer. I'll try looking into it further tomorrow, inspecting the zip is a great idea
I think I might also be failing to understand docker's caching. If the input file (the
.pex
) has changed, how does it know to not try and re-create the image?
Also, my test here is: • Build the image • run
docker images
• Add a comment to a first-party source • re-build image • re-run
docker images
I see two
<none>
images being build each time in addition to the tagged image
e
Docker caches "layers"; basically the affect of each RUN or COPY against the prior fs layer.
It skips through the layers until it hits a change.
b
Wouldn't the result of the
COPY
then be uncached, because the PEX is changing?
e
No. In the example the COPY use
as
which just copies that portion of the input PEX, not the whole thing.
So the deps portion COPY layer should not change if dpes don't change.
SO you should hit a skip for that layer.
Then the src layer - which is very purposefuly ordered to COPY after the deps layer, does produce a new layer.
So you rebuild, but only that 1 src layer at the end.
Its the
as
--from
pair that gets you this fancyness. This trick did not always exist and is relatively new (maybe 5 years old, but not always around).
b
How does the image layer with the
COPY
named
deps
not get invalidated with the new contents of
my-app.pex
?
Even if only the firstparty changed, the overall file changed, right? Sorry, I think I'm missing something obvious
e
Well it does, but everything after it short-circuits. So I guess you're saying COPY whole.pex is slow then - aka you have a huge pex?
You should see docker saying you get a cache hit on that deps layer even if slow.
b
Ah you're saying that
COPY --from=deps /my-app /my-app
is "fast" becuse thats fast, then?
It still "builds" the images for
deps
and
srcs
only to produce the same
deps
image, so the work involved when doing the final thing is re-used.
And it is ordered as such because
deps
changing is much less frequent than
srcs
changing
e
I'm slightly lost. I'm only saying whatever https://pex.readthedocs.io/en/latest/recipes.html#pex-app-in-a-container says.
Maybe sharpen the pain / reason for questions. Are the steps slower than you'd expect?
b
Yeah I think I see the "idea" and understand it well enough now. Doing the steps above with a clean docker I see 6 images
I would expect 5
e
You should always get the same number of layers, the only difference will be which are cache hits for the purposes of whether the next layer needs to run or not.
So, if you had deps, run1, run2, srcs as your layers, run1 and run2 would truly short circuit here.
Since a RUN layer just takes the hash of the RUN command string.
Unlike a COPY layer that takes the hash of the copied-in items.
Maybe that sharpens it up?
b
oh yes I see
I need to move my
COPY
instructions higher up
Im hopping off work, but I think thatll bear fruit
I'm also seeing long build times, but I think thats comparitvely normal? My images are ~7GB. I'll have to see why
I also wonder what it'd look like to export 2 PEXes. One with only deps nd one with only sources. Then do the multi-stage build using their respoective PEXs
e
Pants could do that. Pex will not. That would mean generating invalid artifacts you have to know how to combine. I mean, Pex lets you use it that way of course! In two steps.
b
oh yeah, first thought would be an in-repo pants plugin to test the waters
e
Yeah, in the context of the Docker integration that would be very useful.
Basically an internal speed hack for a well known case or else a documented hack Pants allows you to set up if you know what you're doing.
b
I'd be ok with the latter for sure
And to loop this up unzipping the 2 images for the deps has the exact same layer hashes.
I think maybe our docker invocation should maybe set some setting so we dont see those images? I'll ping Andreas
Ah epiphany. We might be able to offshore this work using Pants itself. `experimental_shell_command`'s output as a dep, where the command does the
--compile
this is promising:
Copy code
experimental_shell_command(
    name="unpacked_deps",
    command=f"PEX_TOOLS=1 python3.8 {'/'.join(('..' for _ in build_file_dir().parts))}/{'.'.join(build_file_dir().parts)}/binary.pex venv --scope=deps --compile app-deps",
    dependencies=[":binary"],
    outputs=["app-deps/lib"],
    tools=["python3.8"],
)
But honestly I dream of a solution where we don't build a pex just to immediately unpack it
e
Yeah, that's a pretty tame dream. That should be straightforward to add to the docker integration I'd think. Alternatively, if Pants exposed the Pex
--layout {loose,packed,zipapp}
option in
pex_binary
you could always just configure loose and COPY
loose.pex/.deps
for the
deps
layer and the rest (
loose.pex/{.bootstrap,__main__.py,PEX-INFO,<1st party>}
) for the
src
layer.
b
It does expose that! That's a great idea
OK so this does speed up build, doesn't produce extraneous images, and works like a charm. How does this play with play with
venv
mode? Ideally the last image layer (I'm learning!) has the code prepped for immediate execution (while not having the files duplicated in the
COPY
destination and the final destination). I guess too, I don't care about what mode it runs in specifically just that it has near-native startup time (which
venv
mode promises)
I suppose I can tweak my
experimental_shell_command
to use
--scope=all
and then use multiple
COPY
instructions for the relevant slices of the venv
e
The
--layout
is orthogonal to the runtime execution mode; so if the
loose
PEX is
--venv
, when you run the loose PEX
__main__.py
it will bootstrap a venv under PEX_ROOT 1st run.
Exactly! Do that instead.
Actually, no - you'd want two scopes still I'd think so the logic of what to copy stays in PEX.
b
Only problem is the glob-ability of
COPY
isn't exclude-friendly. Ideally I have
COPY path/to/lib/<not my 1stparty>
then
COPY path/to/lib/<1stparty>
e
You can be dumber / more robust then.
b
Ah right, two scopes would solve that
e
So - do exactly like the pex docs reccomend, but ourside the container 1st, then use that to copy in the prebuilt slices of the venv.
1
b
yup
So then it's just a bummer we still build an "intermediate" PEX in the sandbox, but honestly not doing that is really weird because we are packaging that PEX
These experimental shell commands are also a bit eyebrow-raising. But for now we're just experimenting. Making this cleaner is doable over time
Ah Pants getting in the way 😕
Copy code
Error expanding output globs: Failed to read link "/tmp/process-executionhCEYrB/mcd/techlabs/projects/aidt/asr_service/app-srcs/bin/python3.8": Absolute symlink: "/usr/bin/python3.8"
I think @witty-crayon-22786 fixed this, but it isn't in 2.12.
We're kind of cheating here I suppose. Making the venv on the dev box, and then just copying it in to the docker image and assuming everything is kosher. Multi-stage unpacks in the dest image of choice. I'm gonna need to stew on this one
e
Yeah. you need to have a interpreter / OS from where you run the unpack that is compatible with where you use the unpack since what is unpacked is influence by the interpreter / OS being used to run the unpack.
The beauty of the PEX unpack inside the container is this is not a worry.
b
right
e
podman (uses buildah) supports mounting in volumes when building an image unlke docker: https://github.com/containers/buildah/blob/main/docs/buildah-build.1.md
That would allow no COPY and just RUN against the mounted in PEX - not sure if that works out faster in the end or not.
I had a great experience with podman / buildah / crun in ~2019. Haven't touched it since.
And I think your overlords are big backers of this project IIRC.
😂 1
Ah no, Guiseppe is RedHat.
b
OK so this test is weird, although only informational as the timing doesnt matter... When I added a newline to a firstparty file, rebuilt the PEX, and re-ran the docker build. The
COPY
from the
deps
stage still ran. Admittedly the
COPY
took 4 seconds, so honestly it doesn't make a difference timing-wise. But it is a datapoint
e
Um, I'm not sure of Docker logic, but it could be imagined that it took the fingerprint of the "context" (+ the fingerprint of the COPY instruction text) as the cache key for COPY instructions. If so, the Docker context includes both the unchanged extraed deps and the changed extracted srcs and so, in sum, it has changed.
Where the Docker "context" - docker's terminology, is the dir tree rooted at the Dockerfile
So, that would totally explain the behavior.
This all makes more sense if you try to imagine how you would implement (caching layers in) a docker build yourself.
b
Secondly I noticed the compilation of
--scope=deps
took ~40 seconds with
layout=zipapp
, but took ~150 seconds with
layout=loose
.
layout=packed
seems to match
layout=zipapp
, I think if I only copy the PEX-specific bits and use
layout=packed
we might be on to something
Yeah this gets the cache turned to 11 (using a packed PEX)
Copy code
FROM ... as deps
COPY my/binary.pex/__main__.py /bin/app.pex/__main__.py
COPY my/binary.pex/.bootstrap /bin/app.pex/.bootstrap
COPY my/binary.pex/PEX-INFO /bin/app.pex/PEX-INFO
COPY my/binary.pex/.deps /bin/app.pex/.deps
RUN PEX_TOOLS=1 python3.8 /bin/app.pex venv --scope=deps --compile /bin/app
plus
Copy code
FROM ... as srcs
COPY my/binary.pex /bin/app.pex
RUN PEX_TOOLS=1 python3.8 /bin/app.pex venv --scope=srcs --compile /bin/app
if I edit a firstparty file inside the PEX and re-run it completely skips the
deps
stage
So now I just need to verify none of
.deps
or the other
deps
stage inputs change by editing a 1stparty file with meaningless (like a comment) changes
Shucks... I see the
deps
stage running. Oddly it doesnt run the
COPY
but does run the
RUN
🤔
e
I'm going to bow out if the live debug session, I think you know the relevant bits at play to dig on this further.
b
"code_hash": "8c7b5100b86874cd4d41b5994e7ca8061ecf5403",
RIP
So because the PEX info file contains the code hash, we'll never get a true cache for deps stage
e
Then delete PEX-INFO after extracting srcs and deps? I'm a bit lost where you're at, but the venv tool code leaves out PEX-INFO in the venv it creates with
--scope deps
for this reason. It only adds it when you complete things in
--scope srcs
.
b
I'm trying to skip the deps stage altogether when only 1stparty changes. But it can't skip when inputs change and in those case PEX_INFO is an input
e
Ok, pull me in if you need help once you're settled on the hacking around explorations.
b
Does PEX need the code hash? I might try deleting it and seeing what goes boom
e
It doesn't for venv execution mode no matter the layout. It does otherwise in packed and zip layouts to share code compilation in
~/pex/user_code/<hash>/
.
b
AssertionError: Expected code_hash to be populated for Spread PEX directory /bin/app.pex.
FOr compiling deps 😞
e
Ok. Well, like I said, once you've settled hacks and have a coherent view of what you'd like to do, pull me in if needed. I have a hard time real-time tracking your debugging / coding session but can probably be pretty helpful if you present something a bit more async with time to absorb the goal, the attempted paths, etc.
b
I think I'm all set minus
PEX-INFO
not staying the same after just touching 1stparty
All of this is very involved, in-general. I think I'll boil it down to an in-repo plugin. In which case I can muck with
PEX_INFO
myself. SO maybe no request, just highlighting a potential improvement