I have a large package that I depend on pytorch CUDA and my Pants #general

I have a large package that I depend on (pytorch C...

plain-carpet-73994

09/27/2022, 11:47 PM

I have a large package that I depend on (pytorch CUDA) and my CI keeps failing due to:

Copy code

There was 1 error downloading required artifacts:
1. torch 1.12.1+cu116 from <https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl>
    Executing /home/companion/.cache/pants/named_caches/pex_root/venvs/ed3764586d1d1a9f27f2d6aa23d6ce7741735e1a/35dbbcf0d165ce1faf5f10df88cd6c8a72fc9216/bin/python -sE /home/companion/.cache/pants/named_caches/pex_root/venvs/ed3764586d1d1a9f27f2d6aa23d6ce7741735e1a/35dbbcf0d165ce1faf5f10df88cd6c8a72fc9216/pex --disable-pip-version-check --no-python-version-warning --exists-action a --no-input --use-deprecated legacy-resolver --isolated -q --cache-dir /home/companion/.cache/pants/named_caches/pex_root/pip_cache --log /tmp/pants-sandbox-TxaSjj/.tmp/pex-pip-log.tsa4zqe_/pip.log download --dest /home/companion/.cache/pants/named_caches/pex_root/downloads/b6bc31244aa2818929fbb30c483c221df471e9d856e805c5a1ff72b131ae9e7b.9bdef5f1ebe741c6adc310711d52f843 --no-deps <https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl> --index-url <https://pypi.org/simple> --extra-index-url <https://download.pytorch.org/whl/cu116> --retries 5 --timeout 15 failed with -9

I think it's just timing out as the download is big. Is there a way to change the

--timeout

? Looked through

./pants help-advanced python

and didn't see any options that seemed relevant.

plain-carpet-73994

09/27/2022, 11:48 PM

Or better, can we speed this up? A

pip install

of the dependency takes about 2.5 minutes including downloading while the pants version takes > 10 minutes on my local laptop.

plain-carpet-73994

09/27/2022, 11:49 PM

e.g. currently it reads

Copy code

210.81s Building 3 requirements for requirements.pex from the 3rdparty/python/default.lock resolve: pytest, torch, torchvision

and it's still chugging along...

enough-analyst-54434

09/28/2022, 1:20 AM

No comment on the speed, that's a longer conversation. On the OP -9 though, that's the Linux OOMKiller killing the Pip download process. You're memory starved.

plain-carpet-73994

09/28/2022, 4:21 PM

Ah! Thanks @enough-analyst-54434. Is pex holding the entire ~2Gb download in memory instead of streaming it out to disk?

enough-analyst-54434

09/28/2022, 4:58 PM

Pex definitely is not. Pip, I have no clue. Pip does most of the work here

👍 1

enough-analyst-54434

09/29/2022, 9:37 PM

It looks like Pip tries awfully hard to use 10k chunks for downloads: https://github.com/pantsbuild/pex/blob/196b4cd5b8dd4b4af2586460530e9a777262be7d/pex/vendor/_vendored/pip/pip/_internal/network/download.py#L161-L164 https://github.com/pantsbuild/pex/blob/196b4cd5b8dd4b4af2586460530e9a777262be7d/pex/vendor/_vendored/pip/pip/_internal/network/download.py#L37-L71 https://github.com/pantsbuild/pex/blob/196b4cd5b8dd4b4af2586460530e9a777262be7d/pex/vendor/_vendored/pip/pip/_vendor/requests/models.py#L56

plain-carpet-73994

09/29/2022, 11:36 PM

Interesting. I wonder why it gets OOM killed then.

enough-analyst-54434

09/29/2022, 11:41 PM

How much ram does the node have vs how much is Pants configured to use assuming that its Pants running the Pex process and that the Pants process itself is the ~only thing of importance and there are not contending processes running on the node.

enough-analyst-54434

09/29/2022, 11:43 PM

If pantsd is in play, thats mainly controlled here: https://www.pantsbuild.org/docs/reference-global#pantsd_max_memory_usage, but if the node is a CI node, this is probably most useful: https://www.pantsbuild.org/docs/using-pants-in-ci#tuning-resource-consumption-advanced

plain-carpet-73994

09/29/2022, 11:46 PM

Good question! The default runner, which is what we were using, has 3.75GB RAM (https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html). That one dies with OOM. The medium runner does not and it has 8GB. I hadn't configured pantsd memory usage but it looks like it's set of 1GB by default (https://www.pantsbuild.org/docs/reference-global#section-pantsd-max-memory-usage) which seems like it should be fine. Indeed, pants is the main user of memory - especially at this stage where it's busy gathering dependencies rather than running tests, etc.

enough-analyst-54434

09/29/2022, 11:57 PM

Do you use remote caching?

plain-carpet-73994

09/30/2022, 12:00 AM

We don't currently. It's on my list to enable it. But I don't that that really solves the issue. I had caching set up with a prior Pants build and the cache fairly quickly gets too big and has to be deleted or it actually slows things down (per the advice here https://www.pantsbuild.org/docs/using-pants-in-ci) which means we have to re-pull our 3rd party dependencies every few days when we nuke the cache. So it'd make the build a little faster and perhaps more reliable, but if we can't pull that dependency at all then the build wouldn't work for very long.

enough-analyst-54434

09/30/2022, 12:01 AM

Ok, just hunting down where this might be getting buffered up.

11 Views

Open in Slack

Previous Next