I'm just playing around with the docker-backend to...
# general
p
I'm just playing around with the docker-backend to build a python-project. the resulting dockerimage seems way to big (eg there are alot of tools contained that shouldnt be there). is there any way to influence that ?
c
Hi, The only thing in the image in addition to what you get
FROM
your base image is that what you add with
COPY
or
ADD
or as a side effect of any
RUN
instructions, so there’s no magic going on here.
How much extra size do you get for your image, compared to the base image? And how far off is that from any
pex
files etc that you may be including?
p
thanks for clarifying. seems to have been user-error ... still trying to get the hang of pex and stuff
👍 1
f
the
.pex
files you’ve included contain both the source code of your project code and the 3rd party dependencies the project has as wheels. If there are a lot of dependencies (and they have compiled code such as
numpy
,
pandas
,
scipy
etc) then the
.pex
files may become large. For anecdotal stats, the largest one I see locally is 200 MB. It is possible to create multi-platform PEX files so that the very same 
.pex
 file can be distributed and run both on Linux and MacOS device. This is achieved by adding multiple platform tags for the 
pex_binary
 target declaration. This will make sure that the 
.pex
 file contains wheels for both platforms (e.g., there will be one 
.whl
 file for 
numpy
 compiled for MacOS and one 
.whl
 file for 
numpy
 compiled for Linux). If there are many requirements with compiled non-Python code, a multi-platform PEX file can be significantly larger than a platform-specific one, so make sure to include only those platforms that need to be supported.
💯 2
p
is there a way to build seperate docker-layers for like 3rd-party dependencies and code that comes from within a monorepo, so that i can make use of layer-caching ?
c
There’s no good built in support for layering in that way. However, it should be doable if you’re willing to add a bit of complexity by building a pex with all your 3rdparty deps, and then another “slim” pex with only your first party code. Then you can add those two pexes as separate
COPY
instructions in your Dockerfile, leveraging the Docker caching.
👍 1
There’s also this to consider when running PEXes in a container, if you want to optimise for size/startup latency: https://pex.readthedocs.io/en/latest/recipes.html#pex-app-in-a-container
p
I'm particularly wondering about situations like this: • a python application depending on pytorch with gpu-support • pytorch with gpu-support is huge • the python application is comparativly small • having all dependencies and the application in a single pex file and `COPY`ing that into one docker-layer • thus a simple change in the application-code creates a new layer with all the dependencies • the resulting big, new layer has to be pushed to registry • the resulting big, new layer has to be pulled from registry Am i misunderstanding things here?
h
Yeah that's right. This is where Andreas's recommendation right above about a "slim pex" is helpful. Specifically, see this option: https://www.pantsbuild.org/docs/reference-pex_binary#codeinclude_requirementscode
p
is there an example somewhere how to achieve that? how would i define a build-target for a pex that just contains the 3rd-party requirements ?
i
I’m also interested in how I would be able to build a slim pex
e
There are ways to hack this up today using `pex_binary`'s
layout
field, but they're pretty hacky. If this is a wide enough neeed, and it sounds like it is!, it's proably better to get an issue started to flesh out an approach.
👍 1
I personally would be terrified of separating sources fro deps and then getting the wrong pair mixed and finding out at container runtime with a missing dep. So, if I were the person requesting this feature, I'd want a guarantee the source "belongs" with the deps in the deps layer and get a failure at image build time if not, For example.
h
c
Ack, no. Was just an idea 💡
👍 1
h
Ah, sg. I too haven't used the workflow, and realize looking at https://github.com/pantsbuild/pants/pull/13894 and https://github.com/pantsbuild/pants/pull/14469 (from @better-sunset-63499) that both cases seem to be loading third-party requirements from the external environment, rather than from Pants building a PEX w/ only requirements and another PEX w/ only source code
e
I'm fairly out of date on Docker details. @curved-television-6568 or anyone else, is Docker layer caching / layer storage content hash based? IE: If I build up a layer 2x using different steps but the end result has the same contents, is the Docker stack smart enough to cache hit and not push / pull those layers around again?
Basically, I'm wondering if a PEX file grew new options you could still build a "fat" PEX but install it in 2 steps:
Copy code
RUN PEX_TOOLS=1 ./my.pex venv --deps-only right/here
RUN PEX_TOOLS=1 ./my.pex venv --source-only right/here
If the layers are content hash based and the 1st step had the same deps as a layer already built, you'd win and be assured your source matched your deps.
f
@polite-secretary-23285 @icy-hair-30586 not sure if that’s what you need, but I was creating pex files for projects that are normally published as wheels/sdist but have entry points available as a console script. E.g.
Copy code
pex_binary(
    name='tabulate-runner',
    description='Run tabulate on the input files.',
    entry_point='tabulate',
    dependencies=["//:tabulate"],
    shebang='/usr/bin/python3',
)
After packaging this (having
tabulate
in the root
requirements.txt
), I can now run:
Copy code
$ cat data.txt     
col1,col2,col3
val1,val2,val3

$ ./tabulate-runner.pex --header --sep "," data.txt 
col1    col2    col3
------  ------  ------
val1    val2    val3
In other words, you can create PEX files using Pants (this can also be achieved using plain
pex
utility of course) that would contain only a single dependency that you could run. Is this any helpful?
c
I'm not entirely sure how the layers are hashed. They do base it on content, but I'm afraid they might also include information about previous layers.. or the full merged result of all layers so far, if you will. Would have to run some tests to check this to be sure, though.
e
Ah, yeah. That's what I remember from ~4 years ago. That seems unfortunate but I very well be missing some implication of having this be pure content-hash based.
It's probably security focused, works like git commits / blockchain / include the lineage as part of the fingerprint.
👍 1
Beautiful, the world has changed since I last looked: https://gist.github.com/aaronlehmann/b42a2eaf633fc949f93b#id-definitions-and-calculations Sure enough:
Copy code
^jsirois@gill /tmp/docker-test $ cat file1.txt
Same content.
Different content.
^jsirois@gill /tmp/docker-test $ cat Dockerfile1
FROM busybox AS prep
COPY file1.txt /
RUN head -1 file1.txt > /content

FROM alpine
COPY --from=prep /content /

CMD ["cat", "/content"]
^jsirois@gill /tmp/docker-test $ cat file2.txt
Same content.
DIFFERENT CONTENT
^jsirois@gill /tmp/docker-test $ cat Dockerfile2
FROM busybox AS prep
COPY file2.txt /
RUN head -1 file2.txt > /content

FROM alpine
COPY --from=prep /content /

CMD ["cat", "/content"]
^jsirois@gill ~/Downloads/docker-test $ docker build -t image1 -f Dockerfile1 .
Sending build context to Docker daemon  6.656kB
Step 1/6 : FROM busybox AS prep
 ---> 7138284460ff
Step 2/6 : COPY file1.txt /
 ---> 8d439976acc6
Step 3/6 : RUN head -1 file1.txt > /content
 ---> Running in d261924d69f7
Removing intermediate container d261924d69f7
 ---> 08661326e660
Step 4/6 : FROM alpine
 ---> 0a97eee8041e
Step 5/6 : COPY --from=prep /content /
 ---> 9608c4649999
Step 6/6 : CMD ["cat", "/content"]
 ---> Running in 79e88cb4d61f
Removing intermediate container 79e88cb4d61f
 ---> af753bc108ea
Successfully built af753bc108ea
Successfully tagged image1:latest
^jsirois@gill ~/Downloads/docker-test $ docker build -t image2 -f Dockerfile2 .
Sending build context to Docker daemon  6.656kB
Step 1/6 : FROM busybox AS prep
 ---> 7138284460ff
Step 2/6 : COPY file2.txt /
 ---> 7da9dee5433d
Step 3/6 : RUN head -1 file2.txt > /content
 ---> Running in 661242608a2e
Removing intermediate container 661242608a2e
 ---> 6e858bb951cd
Step 4/6 : FROM alpine
 ---> 0a97eee8041e
Step 5/6 : COPY --from=prep /content /
 ---> Using cache
 ---> 9608c4649999
Step 6/6 : CMD ["cat", "/content"]
 ---> Using cache
 ---> af753bc108ea
Successfully built af753bc108ea
Successfully tagged image2:latest
^jsirois@gill ~/Downloads/docker-test $ docker run --rm image1
Same content.
^jsirois@gill ~/Downloads/docker-test $ docker run --rm image2
Same content.
^jsirois@gill ~/Downloads/docker-test $ docker image inspect image1 | jq '.[] | {"Id", "RepoTags", "RootFS"}'
{
  "Id": "sha256:af753bc108ea9b5827fde0dde0472c822d05d4119803b2a7e2205614d00d9a53",
  "RepoTags": [
    "image1:latest",
    "image2:latest"
  ],
  "RootFS": {
    "Type": "layers",
    "Layers": [
      "sha256:1a058d5342cc722ad5439cacae4b2b4eedde51d8fe8800fcf28444302355c16d",
      "sha256:3f093a6664ad9960eeaae1facc5f32f087f066421c8e251b045170d022f417ef"
    ]
  }
}
^jsirois@gill ~/Downloads/docker-test $ docker image inspect image2 | jq '.[] | {"Id", "RepoTags", "RootFS"}'
{
  "Id": "sha256:af753bc108ea9b5827fde0dde0472c822d05d4119803b2a7e2205614d00d9a53",
  "RepoTags": [
    "image1:latest",
    "image2:latest"
  ],
  "RootFS": {
    "Type": "layers",
    "Layers": [
      "sha256:1a058d5342cc722ad5439cacae4b2b4eedde51d8fe8800fcf28444302355c16d",
      "sha256:3f093a6664ad9960eeaae1facc5f32f087f066421c8e251b045170d022f417ef"
    ]
  }
}
So my proposal would work.
🙌 3
i
@enough-analyst-54434 I’ve used a similar approach in the java world (one copy statement with just dependency JARs, another copy statement with just the app jar), and it worked quite well (dependencies are usually slower-changing than app code, so every new image would cost you just the size of your app jar). Jib (https://github.com/GoogleContainerTools/jib) does something similar behind the scenes
e
Aha, yeah. Great. I just went through the exercise of proving this will work and filed a Pex ticket for the new feature here: https://github.com/pantsbuild/pex/issues/1631 Hopefully I've gotten things right.
👍 1
🚀 1
i
Is the entire “wrap into PEX” part really needed for a container? I guess what I really want is something similar to
RUN pip install dependencies
and then
RUN pip install mypackage
in the docker file
as you can guess, I’m pretty new to trying to package a python app/lib into a container, sorry if I’m missing something really basic here
e
Internally Pants is building PEXes all along the way. The genesis is historical but the need is modern: the current sandboxing scheme for running processes in Pants rules is poor at materializing 1000s of files performantly. Now, using a single PEX zip file has its problems too; so all internal PEXes are
--layout packed
to strike a balance between cache hits for unchanged portions of a venv and sandbox materialization performance. All of this also applies when remote caching is turned on too.
In other words, we're trying to deliver the best end to end performance of testing and building your Python code all the way through getting it set up in a container image. We may not have everything optimized best yet. It's tricky to speed up all phases. For some steps, X might be 10% faster, but make other steps 20% slower.
If it's not clear, the self imposed constraint Pants is under here is executing all processes in a hermetic sandbox (that can be remoted if needed). This requires saving and copying all files needed. It's that part that's hard to do performantly but critical for reproducibility.
The optimization from the Pex side is here: https://github.com/pantsbuild/pex/pull/1634 I'll be getting out a Pex release today and upgrading Pants.
❤️ 1