Is there anyway to make publishing of docker containers para Pants #general

Is there anyway to make publishing of docker conta...

gentle-flower-25372

03/15/2024, 5:13 PM

Is there anyway to make publishing of docker containers parallelized? It seems like it's happening serially. Is my perception wrong?

silly-queen-7197

03/15/2024, 5:39 PM

Please let me know how you solve this lol

silly-queen-7197

03/15/2024, 5:39 PM

It's definitely serially

refined-addition-53644

03/15/2024, 5:39 PM

It's not supported https://github.com/pantsbuild/pants/issues/17613

silly-queen-7197

03/15/2024, 5:40 PM

I think one solution is to not use publish but package, introspect what was built, and then docker push* yourself

gorgeous-winter-99296

03/15/2024, 6:13 PM

I'm quite sure the packaging happens in parallel, just the upload doesn't.

💯 1

gentle-flower-25372

03/15/2024, 6:24 PM

@gorgeous-winter-99296 did you ever find a workaround of any kind? I've been using some form of meta programming with pipeline tools (ado and gitlab) and generating one job per docker container target. It's pretty annoying, but it works.

gorgeous-winter-99296

03/15/2024, 6:42 PM

I've never found the upload step to be a problem, especially not with proper layer design to minimise what's actually uploaded. The build step does take a while but again that's parallel...

gentle-flower-25372

03/15/2024, 6:43 PM

I have some machine learning images that are massive.

Copy code

7.17GB
7.23GB
8.67GB
3.76GB
8.75GB
3.16GB

So the push step takes much longer than you'd think 🙂

gorgeous-winter-99296

03/15/2024, 6:49 PM

Hah, I'm well aware -- I also do ML :P I'll have to check my stuff now, my experience is that in mostly bandwidth bound both up and down.

gentle-flower-25372

03/15/2024, 6:50 PM

I'd love to use your oci plugin but I'm not familiar with OCI in general and am not sure how to translate certain things over to it. I looked at buildah and they had some interesting examples, but I couldn't figure out how to translate them to umoci. specifically things like:

Copy code

ctr1=$(buildah from "${1:-fedora}")

## Get all updates and install our minimal httpd server
buildah run "$ctr1" -- dnf update -y
buildah run "$ctr1" -- dnf install -y lighttpd

gorgeous-winter-99296

03/15/2024, 6:51 PM

Was about to say, skopeo vs docker, different registries... A lot can differ there.

gorgeous-winter-99296

03/15/2024, 6:54 PM

Yeah scripted builds are hard to translate, and especially when they talk to unstable externals. I reluctantly support them but it's a huge source of unnecessary rebuilds. Proper OCI is just layers of files...

gentle-flower-25372

03/15/2024, 6:55 PM

It makes it hard when you don't know the exact list of outputs, i.e. apt install nginx

gentle-flower-25372

03/15/2024, 6:56 PM

I am guessing oci is more like docker's COPY --link but 100% of the layers?

gorgeous-winter-99296

03/15/2024, 7:02 PM

Yes, I think so, at a glance. I think dumber... Each layer is just a tar, so the final rootfs is just those tears unpacked over each other.

💯 1

gorgeous-winter-99296

03/15/2024, 7:03 PM

I haven't used docker itself for many years now, so a bit out of the loop there.

👍 1

gorgeous-winter-99296

03/15/2024, 8:02 PM

I just went through and checked our CI logs from the last days... Our biggest regularly built image is ~6 GB. With caches on CI etc all our image gets built in ~3 minutes with publishing taking 1 minute (mostly 2-3 layers changing) - ~2GB. The upload speed for whatever changes in that single image seems to be around ~400-500 mbit -- I'm not actually storing exact logs for what layers get uploaded, only timestamps so eyeballing that part from manifests. Do you have any comparable figures? With your current hack, how big is your gain from uploading in parallel? Not trying to discredit your claims btw, just trying to understand the case. If I had as many big images, to upload full speed in parallel I'd need AR to accept a few GBit/s... which seems doubtful. Our builds use this as the baseline: https://www.pantsbuild.org/blog/2022/08/02/optimizing-python-docker-deploys-using-pants

gentle-flower-25372

03/15/2024, 8:08 PM

@gorgeous-winter-99296 with my hack it goes from 23 minutes total (when 3 ml images are in a single

pants publish

) to: • 16 minutes (openai -- biggest image) • 11 min (pytorch) • 1 min (tensorflow)

gorgeous-winter-99296

03/15/2024, 8:09 PM

So 16 minutes from start to finish? I.e., all start together and last to finish is openai after 16?

✅ 1

gentle-flower-25372

03/15/2024, 8:09 PM

For some reason the pex files keep being built, even when kicking off a CI Pipeline off the same git sha. I don't know why or how to diagnose. I'm using a remote cache.

gorgeous-winter-99296

03/15/2024, 8:11 PM

Are you sure they get built, or just printing that they get built? The log output is the same IME, so using the metrics to see cache hit rate or comparing timestamps is more helpful.

gentle-flower-25372

03/15/2024, 8:13 PM

I thought a

Starting

followed by a

Canceled

meant it found it in the cache.

Copy code

19:49:35.59 [INFO] Starting: Building 30 requirements for apps.app1/binary-deps.pex from the apps/alt/pants.lock resolve: awscli<2.0.0,>=1.32, boto3<2.0.0,>=1.34.57, click<9.0.0,>=8, gitpython<4.0.0,>=3.1.41, netaddr==0.8.0, parve... (589 characters truncated)
19:49:36.34 [INFO] Canceled: Building 30 requirements for apps.app1/binary-deps.pex from the apps/alt/pants.lock resolve: awscli<2.0.0,>=1.32, boto3<2.0.0,>=1.34.57, click<9.0.0,>=8, gitpython<4.0.0,>=3.1.41, netaddr==0.8.0, parve... (589 characters truncated)

gentle-flower-25372

03/15/2024, 8:13 PM

So when I say "It's not found in cache" -- I really mean that I never observe a

Canceled

message for that PEX.

gorgeous-winter-99296

03/15/2024, 8:14 PM

Ah, yes, sorry. But it might also be speculatively built before you get response from cache, in which case it'd complete while technically being a cache hit. Though for a torch build that's unlikely

gentle-flower-25372

03/15/2024, 8:14 PM

@gorgeous-winter-99296 in debug logs I see this

Copy code

remote cache miss for...

gentle-flower-25372

03/15/2024, 8:16 PM

I don't know if it's because the artifact caching is failing to upload (for some reason) or if it's something else. I am running now to compare the output of two runs.

gentle-flower-25372

03/15/2024, 8:17 PM

I'm also using https://github.com/Rantanen/proxide to try to debug grpc messages flowing back and forth.

gorgeous-winter-99296

03/15/2024, 8:17 PM

Ah, 👍 Then we know that. Interesting. https://github.com/pantsbuild/pants/issues/12203 has some details. In our case I diagnosed it to be related using pyenv/interpreters being in different locations

gentle-flower-25372

03/15/2024, 8:19 PM

ah... my ci environment is hacked to hell and we have 24 different users on a single VM each running their own ci build agent. I'm guessing that's why the cache hit rate is so low.

gentle-flower-25372

03/15/2024, 8:21 PM

Did you ever figure this out? In other words, do you have a known workaround to improve cache hit rates?

gorgeous-winter-99296

03/15/2024, 8:22 PM

Using persistent pantsd and build directories we get good cache hits now within CI but users gain nothing.

gentle-flower-25372

03/15/2024, 8:23 PM

Can you give me absolute paths for those directories you cached? I want to avoid me misinterpreting what you're saying.

gentle-flower-25372

03/15/2024, 8:24 PM

for pantsd, do you mean

${repo_root}/.pants.d/

gorgeous-winter-99296

03/15/2024, 8:25 PM

We literally reuse the checkout directory, and the container running CI stays alive and services other jobs, eventually running into another pants run. Sometimes pantsd has died for various reasons, but we make no effort to kill it and often its alive.

gentle-flower-25372

03/15/2024, 8:26 PM

interesting, ok.

gorgeous-winter-99296

03/15/2024, 8:46 PM

Ok, I'm really heading off soon~ but realized I answered the question about remotes very differently a few weeks back. I'm seeing very different cache hit rates in different commands. We run a

pants package

equivalent to validate a bunch of stuff, and that works... poorly with caches. On the other end, we run a

pants publish

and that has a very high cache hit rate. So overall our CI didn't speed up significantly from adding remote caches, but some steps did. Ensuring local caches and keeping pantsd alive for longer was more important for speed.

gentle-flower-25372

03/15/2024, 8:47 PM

Thanks, @gorgeous-winter-99296 -- I appreciate the quality dialogue.

4 Views

Open in Slack

Previous Next