gifted-refrigerator-82216
09/14/2023, 8:16 AMpants test ::
and inspected sandboxes via:
$> rm -rf /tmp/pants-sandbox-*
$> ./pants --no-local-cache --keep-sandboxes=always test packages/<mypackage>/tests/:tests
$> find /tmp/pants-sandbox-* -type f -size +100M -printf "%n %s %f\n" | sort | uniq --count
The last command finds files >100 MB and prints their hard-link count and file size in bytes along with the file name.
A typical output contains lines like these:
22 1 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
48 1 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
22 1 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
The first column is added by uniq --count
. The output means that e.g. torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
is found 22 times in /tmp/pants-sandbox-*
while each file has a hard-link count of 1.
That means all 22 files are invididual files which had to be written individually - which explains the high disk write load.
The listed files are mostly located in /tmp/pants-sandbox-*/{pytest,requirements}.pex/.deps
.
Given the large number of test modules and size of their dependencies, this behavior easily saturates the write bandwidth of any SSD we could throw at this problem.
Is there a way to avoid this issue?
BTW, I also wonder why the wheels need to be in sandboxes at all as I assumed that pants caches venvs where those dependencies are already pre-installed.
Thank you very much for advice and insight!broad-processor-92400
09/14/2023, 8:26 AMgifted-refrigerator-82216
09/14/2023, 8:27 AMbroad-processor-92400
09/14/2023, 8:28 AMhappy-kitchen-89482
09/14/2023, 9:23 AMhappy-kitchen-89482
09/14/2023, 9:24 AMhappy-kitchen-89482
09/14/2023, 9:25 AMhappy-kitchen-89482
09/14/2023, 9:25 AMgifted-refrigerator-82216
09/14/2023, 9:58 AMgifted-refrigerator-82216
09/14/2023, 10:00 AM--python-run-against-entire-lockfile
but the problem got only worse:
54 1 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
54 1 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
54 1 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
bitter-ability-32190
09/14/2023, 11:26 AMgifted-refrigerator-82216
09/14/2023, 11:28 AMIf you're in the sandbox, try making a hardlink to the cacheHow do I do that? I'm not that familiar with pants internals..
bitter-ability-32190
09/14/2023, 11:40 AMln (file in cache) (local path)
gifted-refrigerator-82216
09/14/2023, 3:21 PM/tmp/pants-sandbox-WIrUhg/pytest.pex/.deps/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
.
I find the same file in the cache at:
~/.cache/pants/named_caches/pex_root/downloads/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
~/.cache/pants/named_caches/pex_root/installed_wheel_zips/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
Do you suggest to replace the file in the sandbox with a hard link to the cache? I.e.:
ln ~/.cache/pants/named_caches/pex_root/downloads/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl /tmp/pants-sandbox-WIrUhg/pytest.pex/.deps/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
(or the other cache file, I don't know)
I'm not sure I understand the goal. That sandbox was populated by pants and I only preserved to examine what's going on. What would be the point of a manual hard link?
I'm looking for a way to prevent pants from copying identical wheels many times to its sandboxes. If pants would use a hard-link here, that would solve my problem.bitter-ability-32190
09/14/2023, 3:24 PMgifted-refrigerator-82216
09/14/2023, 3:28 PMln: failed to create hard link
. I'll review and adopt my setup to get this fixed.
Thank you so far!bitter-ability-32190
09/14/2023, 3:29 PMbitter-ability-32190
09/14/2023, 3:29 PMgifted-refrigerator-82216
09/14/2023, 4:24 PM/tmp
(via XDG_CACHE_HOME=/tmp ./pants ...
, which is a bad idea but proves the solution) and suddenly I see:
1 2 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
21 22 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
21 22 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
1 2 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
42 43 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
6 7 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
So all instances have hard-links attached! Yay, problem solved! 🥳
Thanks a lot @bitter-ability-32190! =)
(Hindsight: I was running pants in a docker container, with ~/.cache/
being shared from the host via docker run --volume ...
and /tmp
being in the container. On the host, it's all on the same SSD but from container perspective that seems to be different filesystems.)happy-kitchen-89482
09/15/2023, 11:23 AMgifted-refrigerator-82216
09/15/2023, 3:39 PM~/.cache/pants
while sandboxes are in /tmp
and it's not uncommon for both to be on different filesystems, especially on CI.
And if you happen to step into this trap, you don't really notice except for increased write load and disk consumption (if you happen to check that). The adverse effect is amplified by number of tests and the size of their dependencies. In our case ensuring that hard-links are possible reduced the test time by a factor of 4-8 (depends on disk speed). Also SSD bandwidth is no longer saturated. So for us the effect is huge.
For these reasons I'd propose to mention this issue on the troubleshooting page. And maybe even consider a runtime info message with a link to that issue in case pants is unable to use hard-links when populating its sandboxes.bitter-ability-32190
09/15/2023, 4:15 PMenough-analyst-54434
09/16/2023, 10:41 AMgifted-refrigerator-82216
09/18/2023, 8:29 AMenough-analyst-54434
09/18/2023, 9:12 AMenough-analyst-54434
09/18/2023, 9:12 AMenough-analyst-54434
09/18/2023, 9:14 AMenough-analyst-54434
09/18/2023, 9:18 AMenough-analyst-54434
09/18/2023, 9:20 AMenough-analyst-54434
09/18/2023, 9:30 AM