Hi all, we're using pants for python code, which h...
# general
g
Hi all, we're using pants for python code, which has lots of dependencies with some being very large (e.g. tensorflow, torch). I observed extremely high disk write load (several GB/s) when running e.g.
pants test ::
and inspected sandboxes via:
Copy code
$> rm -rf /tmp/pants-sandbox-*
$> ./pants --no-local-cache --keep-sandboxes=always test packages/<mypackage>/tests/:tests
$> find /tmp/pants-sandbox-* -type f -size +100M -printf "%n %s %f\n" | sort | uniq --count
The last command finds files >100 MB and prints their hard-link count and file size in bytes along with the file name. A typical output contains lines like these:
Copy code
22 1 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl    
    48 1 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
    22 1 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
The first column is added by
uniq --count
. The output means that e.g.
torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
is found 22 times in
/tmp/pants-sandbox-*
while each file has a hard-link count of 1. That means all 22 files are invididual files which had to be written individually - which explains the high disk write load. The listed files are mostly located in
/tmp/pants-sandbox-*/{pytest,requirements}.pex/.deps
. Given the large number of test modules and size of their dependencies, this behavior easily saturates the write bandwidth of any SSD we could throw at this problem. Is there a way to avoid this issue? BTW, I also wonder why the wheels need to be in sandboxes at all as I assumed that pants caches venvs where those dependencies are already pre-installed. Thank you very much for advice and insight!
b
What version of pants are you using? 2.17 (just released) theoretically has optimisations for hard-linking files like that.
g
Oops, we're using 2.17 🙂
b
Ah. Maybe @bitter-ability-32190 can answer if this is expected behaviour
h
Pants does cache venvs, but if you're running in CI that cache may not be conserved across runs (depending on your CI setup) unless you take steps for it to do so
You can try using this option https://www.pantsbuild.org/docs/reference-python#run_against_entire_lockfile to avoid the lockfile subsetting, and see if that helps
The downside is that you're running all tests against all requirements, not just the subset each test actually needs, so you'll invalidate tests more frequently than is strictly necessary
but that may be a good tradeoff, YMMV
g
The experiment above I ran locally, all cache is preserved and everything happens on the same filesystem, i.e. hard-links are possible.
I tried
--python-run-against-entire-lockfile
but the problem got only worse:
Copy code
54 1 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
     54 1 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
     54 1 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
b
Are your cache and sandbox root on the same filesystem? If you're in the sandbox, try making a hardlink to the cache
g
Yes, same filesystem.
If you're in the sandbox, try making a hardlink to the cache
How do I do that? I'm not that familiar with pants internals..
b
That's a Unix command
ln (file in cache) (local path)
g
I'm not sure I understand you correctly. Let's take e.g.
/tmp/pants-sandbox-WIrUhg/pytest.pex/.deps/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
. I find the same file in the cache at:
Copy code
~/.cache/pants/named_caches/pex_root/downloads/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
~/.cache/pants/named_caches/pex_root/installed_wheel_zips/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
Do you suggest to replace the file in the sandbox with a hard link to the cache? I.e.:
ln ~/.cache/pants/named_caches/pex_root/downloads/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl /tmp/pants-sandbox-WIrUhg/pytest.pex/.deps/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
(or the other cache file, I don't know) I'm not sure I understand the goal. That sandbox was populated by pants and I only preserved to examine what's going on. What would be the point of a manual hard link? I'm looking for a way to prevent pants from copying identical wheels many times to its sandboxes. If pants would use a hard-link here, that would solve my problem.
b
Sorry, this is a test to see if the hardlink is even possible. E.g. testing the "are these on the same filesystem" answer 🙂
g
Ooh right 😅 And rightfully so because unexpectedly I get
ln: failed to create hard link
. I'll review and adopt my setup to get this fixed. Thank you so far!
b
😄 Glad we got that solved
And hope you enjoy those sweet sweet hard links 😉
g
Ok, I made sure that the cache is in
/tmp
(via
XDG_CACHE_HOME=/tmp ./pants ...
, which is a bad idea but proves the solution) and suddenly I see:
Copy code
1 2 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
     21 22 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
     21 22 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
      1 2 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
     42 43 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
      6 7 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
So all instances have hard-links attached! Yay, problem solved! 🥳 Thanks a lot @bitter-ability-32190! =) (Hindsight: I was running pants in a docker container, with
~/.cache/
being shared from the host via
docker run --volume ...
and
/tmp
being in the container. On the host, it's all on the same SSD but from container perspective that seems to be different filesystems.)
✅ 1
👆 1
h
That's a fun docker gotcha for people to cut themselves on :)
g
To conclude, I'd like to highlight the subtlety of this issue. It's easy to step into it: by default cache is in
~/.cache/pants
while sandboxes are in
/tmp
and it's not uncommon for both to be on different filesystems, especially on CI. And if you happen to step into this trap, you don't really notice except for increased write load and disk consumption (if you happen to check that). The adverse effect is amplified by number of tests and the size of their dependencies. In our case ensuring that hard-links are possible reduced the test time by a factor of 4-8 (depends on disk speed). Also SSD bandwidth is no longer saturated. So for us the effect is huge. For these reasons I'd propose to mention this issue on the troubleshooting page. And maybe even consider a runtime info message with a link to that issue in case pants is unable to use hard-links when populating its sandboxes.
b
Our docs live in repo if you wanna make a PR 🙂 Also the hardlinks are a veeeeery recent change (introduced in 2.17.0!). What you say isn't uncommon, though and I think we do want to find ways to get the hardlinking improvement into more hands. So think about this as a stepping stone on a longer path, and please do share ideas for future stones (like the ones you just suggested 🙂)
e
@happy-kitchen-89482 it's more than a fun docker gotcha, I think it's basically only a non gotcha on Mac. Having /tmp on tmpfs is very common in modern vanilla consumer Linux.
g
One last question that I didn't understand yet. Why are wheels copied into a sandbox at all? I was assuming that pants caches venvs where those dependencies are already pre-installed, so there would be no need to copy the wheels.
e
The venv may or may not be cached. Instead of try, detect special sort of failure (no venv), then add wheels to sandbox, then retry, the more generic, populate the sandbox with needed inputs then run is taken. This works robustly across supported languages and processes that may or may not use caches.
✅ 1
At the Pants engine level, Python is really not a thing. It's all just processes.
Maybe more clearly: the venvs are not cached by Pants, they are cached by Pex which is an opaque subprocess Pants runs.
What would be useful here, and Blaze had it, was a sandbox populated via a lazy filesystem that only materialized files when actually read.
Pants has never found the time to invest in technologies like this though that are ultimately pretty platform specific. There is an unused buildfs fuse filesystem in the codebase though; so some of the work is there for a poor-mans lazy fs.
But really, Pants even allowing you to have sandboxes and filestores on different filesystems at all probably doesn't make sense. It seems like a footgun degree of option freedom / bad defaults that Pants should just remove.