Is there anyway to make publishing of docker conta...
# general
g
Is there anyway to make publishing of docker containers parallelized? It seems like it's happening serially. Is my perception wrong?
s
Please let me know how you solve this lol
It's definitely serially
r
s
I think one solution is to not use publish but package, introspect what was built, and then docker push* yourself
g
I'm quite sure the packaging happens in parallel, just the upload doesn't.
šŸ’Æ 1
g
@gorgeous-winter-99296 did you ever find a workaround of any kind? I've been using some form of meta programming with pipeline tools (ado and gitlab) and generating one job per docker container target. It's pretty annoying, but it works.
g
I've never found the upload step to be a problem, especially not with proper layer design to minimise what's actually uploaded. The build step does take a while but again that's parallel...
g
I have some machine learning images that are massive.
Copy code
7.17GB
7.23GB
8.67GB
3.76GB
8.75GB
3.16GB
So the push step takes much longer than you'd think šŸ™‚
g
Hah, I'm well aware -- I also do ML :P I'll have to check my stuff now, my experience is that in mostly bandwidth bound both up and down.
g
I'd love to use your oci plugin but I'm not familiar with OCI in general and am not sure how to translate certain things over to it. I looked at buildah and they had some interesting examples, but I couldn't figure out how to translate them to umoci. specifically things like:
Copy code
ctr1=$(buildah from "${1:-fedora}")

## Get all updates and install our minimal httpd server
buildah run "$ctr1" -- dnf update -y
buildah run "$ctr1" -- dnf install -y lighttpd
g
Was about to say, skopeo vs docker, different registries... A lot can differ there.
Yeah scripted builds are hard to translate, and especially when they talk to unstable externals. I reluctantly support them but it's a huge source of unnecessary rebuilds. Proper OCI is just layers of files...
g
It makes it hard when you don't know the exact list of outputs, i.e. apt install nginx
I am guessing oci is more like docker's COPY --link but 100% of the layers?
g
Yes, I think so, at a glance. I think dumber... Each layer is just a tar, so the final rootfs is just those tears unpacked over each other.
šŸ’Æ 1
I haven't used docker itself for many years now, so a bit out of the loop there.
šŸ‘ 1
I just went through and checked our CI logs from the last days... Our biggest regularly built image is ~6 GB. With caches on CI etc all our image gets built in ~3 minutes with publishing taking 1 minute (mostly 2-3 layers changing) - ~2GB. The upload speed for whatever changes in that single image seems to be around ~400-500 mbit -- I'm not actually storing exact logs for what layers get uploaded, only timestamps so eyeballing that part from manifests. Do you have any comparable figures? With your current hack, how big is your gain from uploading in parallel? Not trying to discredit your claims btw, just trying to understand the case. If I had as many big images, to upload full speed in parallel I'd need AR to accept a few GBit/s... which seems doubtful. Our builds use this as the baseline: https://www.pantsbuild.org/blog/2022/08/02/optimizing-python-docker-deploys-using-pants
g
@gorgeous-winter-99296 with my hack it goes from 23 minutes total (when 3 ml images are in a single
pants publish
) to: • 16 minutes (openai -- biggest image) • 11 min (pytorch) • 1 min (tensorflow)
g
So 16 minutes from start to finish? I.e., all start together and last to finish is openai after 16?
āœ… 1
g
For some reason the pex files keep being built, even when kicking off a CI Pipeline off the same git sha. I don't know why or how to diagnose. I'm using a remote cache.
g
Are you sure they get built, or just printing that they get built? The log output is the same IME, so using the metrics to see cache hit rate or comparing timestamps is more helpful.
g
I thought a
Starting
followed by a
Canceled
meant it found it in the cache.
Copy code
19:49:35.59 [INFO] Starting: Building 30 requirements for apps.app1/binary-deps.pex from the apps/alt/pants.lock resolve: awscli<2.0.0,>=1.32, boto3<2.0.0,>=1.34.57, click<9.0.0,>=8, gitpython<4.0.0,>=3.1.41, netaddr==0.8.0, parve... (589 characters truncated)
19:49:36.34 [INFO] Canceled: Building 30 requirements for apps.app1/binary-deps.pex from the apps/alt/pants.lock resolve: awscli<2.0.0,>=1.32, boto3<2.0.0,>=1.34.57, click<9.0.0,>=8, gitpython<4.0.0,>=3.1.41, netaddr==0.8.0, parve... (589 characters truncated)
So when I say "It's not found in cache" -- I really mean that I never observe a
Canceled
message for that PEX.
g
Ah, yes, sorry. But it might also be speculatively built before you get response from cache, in which case it'd complete while technically being a cache hit. Though for a torch build that's unlikely
g
@gorgeous-winter-99296 in debug logs I see this
Copy code
remote cache miss for...
I don't know if it's because the artifact caching is failing to upload (for some reason) or if it's something else. I am running now to compare the output of two runs.
I'm also using https://github.com/Rantanen/proxide to try to debug grpc messages flowing back and forth.
g
Ah, šŸ‘ Then we know that. Interesting. https://github.com/pantsbuild/pants/issues/12203 has some details. In our case I diagnosed it to be related using pyenv/interpreters being in different locations
g
ah... my ci environment is hacked to hell and we have 24 different users on a single VM each running their own ci build agent. I'm guessing that's why the cache hit rate is so low.
Did you ever figure this out? In other words, do you have a known workaround to improve cache hit rates?
g
Using persistent pantsd and build directories we get good cache hits now within CI but users gain nothing.
g
Can you give me absolute paths for those directories you cached? I want to avoid me misinterpreting what you're saying.
for pantsd, do you mean
${repo_root}/.pants.d/
?
g
We literally reuse the checkout directory, and the container running CI stays alive and services other jobs, eventually running into another pants run. Sometimes pantsd has died for various reasons, but we make no effort to kill it and often its alive.
g
interesting, ok.
g
Ok, I'm really heading off soon~ but realized I answered the question about remotes very differently a few weeks back. I'm seeing very different cache hit rates in different commands. We run a
pants package
equivalent to validate a bunch of stuff, and that works... poorly with caches. On the other end, we run a
pants publish
and that has a very high cache hit rate. So overall our CI didn't speed up significantly from adding remote caches, but some steps did. Ensuring local caches and keeping pantsd alive for longer was more important for speed.
g
Thanks, @gorgeous-winter-99296 -- I appreciate the quality dialogue.