I have a large package that I depend on (pytorch C...
# general
p
I have a large package that I depend on (pytorch CUDA) and my CI keeps failing due to:
Copy code
There was 1 error downloading required artifacts:
1. torch 1.12.1+cu116 from <https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl>
    Executing /home/companion/.cache/pants/named_caches/pex_root/venvs/ed3764586d1d1a9f27f2d6aa23d6ce7741735e1a/35dbbcf0d165ce1faf5f10df88cd6c8a72fc9216/bin/python -sE /home/companion/.cache/pants/named_caches/pex_root/venvs/ed3764586d1d1a9f27f2d6aa23d6ce7741735e1a/35dbbcf0d165ce1faf5f10df88cd6c8a72fc9216/pex --disable-pip-version-check --no-python-version-warning --exists-action a --no-input --use-deprecated legacy-resolver --isolated -q --cache-dir /home/companion/.cache/pants/named_caches/pex_root/pip_cache --log /tmp/pants-sandbox-TxaSjj/.tmp/pex-pip-log.tsa4zqe_/pip.log download --dest /home/companion/.cache/pants/named_caches/pex_root/downloads/b6bc31244aa2818929fbb30c483c221df471e9d856e805c5a1ff72b131ae9e7b.9bdef5f1ebe741c6adc310711d52f843 --no-deps <https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl> --index-url <https://pypi.org/simple> --extra-index-url <https://download.pytorch.org/whl/cu116> --retries 5 --timeout 15 failed with -9
I think it's just timing out as the download is big. Is there a way to change the
--timeout
? Looked through
./pants help-advanced python
and didn't see any options that seemed relevant.
Or better, can we speed this up? A
pip install
of the dependency takes about 2.5 minutes including downloading while the pants version takes > 10 minutes on my local laptop.
e.g. currently it reads
Copy code
210.81s Building 3 requirements for requirements.pex from the 3rdparty/python/default.lock resolve: pytest, torch, torchvision
and it's still chugging along...
e
No comment on the speed, that's a longer conversation. On the OP -9 though, that's the Linux OOMKiller killing the Pip download process. You're memory starved.
p
Ah! Thanks @enough-analyst-54434. Is pex holding the entire ~2Gb download in memory instead of streaming it out to disk?
e
Pex definitely is not. Pip, I have no clue. Pip does most of the work here
👍 1
p
Interesting. I wonder why it gets OOM killed then.
e
How much ram does the node have vs how much is Pants configured to use assuming that its Pants running the Pex process and that the Pants process itself is the ~only thing of importance and there are not contending processes running on the node.
If pantsd is in play, thats mainly controlled here: https://www.pantsbuild.org/docs/reference-global#pantsd_max_memory_usage, but if the node is a CI node, this is probably most useful: https://www.pantsbuild.org/docs/using-pants-in-ci#tuning-resource-consumption-advanced
p
Good question! The default runner, which is what we were using, has 3.75GB RAM (https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html). That one dies with OOM. The medium runner does not and it has 8GB. I hadn't configured pantsd memory usage but it looks like it's set of 1GB by default (https://www.pantsbuild.org/docs/reference-global#section-pantsd-max-memory-usage) which seems like it should be fine. Indeed, pants is the main user of memory - especially at this stage where it's busy gathering dependencies rather than running tests, etc.
e
Do you use remote caching?
p
We don't currently. It's on my list to enable it. But I don't that that really solves the issue. I had caching set up with a prior Pants build and the cache fairly quickly gets too big and has to be deleted or it actually slows things down (per the advice here https://www.pantsbuild.org/docs/using-pants-in-ci) which means we have to re-pull our 3rd party dependencies every few days when we nuke the cache. So it'd make the build a little faster and perhaps more reliable, but if we can't pull that dependency at all then the build wouldn't work for very long.
e
Ok, just hunting down where this might be getting buffered up.