Hi all we re using pants for python code which has lots of d Pants #general

Hi all, we're using pants for python code, which h...

gifted-refrigerator-82216

09/14/2023, 8:16 AM

Hi all, we're using pants for python code, which has lots of dependencies with some being very large (e.g. tensorflow, torch). I observed extremely high disk write load (several GB/s) when running e.g.

pants test ::

and inspected sandboxes via:

Copy code

$> rm -rf /tmp/pants-sandbox-*
$> ./pants --no-local-cache --keep-sandboxes=always test packages/<mypackage>/tests/:tests
$> find /tmp/pants-sandbox-* -type f -size +100M -printf "%n %s %f\n" | sort | uniq --count

The last command finds files >100 MB and prints their hard-link count and file size in bytes along with the file name. A typical output contains lines like these:

Copy code

22 1 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl    
    48 1 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
    22 1 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl

The first column is added by

uniq --count

. The output means that e.g.

torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl

is found 22 times in

/tmp/pants-sandbox-*

while each file has a hard-link count of 1. That means all 22 files are invididual files which had to be written individually - which explains the high disk write load. The listed files are mostly located in

/tmp/pants-sandbox-*/{pytest,requirements}.pex/.deps

. Given the large number of test modules and size of their dependencies, this behavior easily saturates the write bandwidth of any SSD we could throw at this problem. Is there a way to avoid this issue? BTW, I also wonder why the wheels need to be in sandboxes at all as I assumed that pants caches venvs where those dependencies are already pre-installed. Thank you very much for advice and insight!

broad-processor-92400

09/14/2023, 8:26 AM

What version of pants are you using? 2.17 (just released) theoretically has optimisations for hard-linking files like that.

gifted-refrigerator-82216

09/14/2023, 8:27 AM

Oops, we're using 2.17 🙂

broad-processor-92400

09/14/2023, 8:28 AM

Ah. Maybe @bitter-ability-32190 can answer if this is expected behaviour

happy-kitchen-89482

09/14/2023, 9:23 AM

Pants does cache venvs, but if you're running in CI that cache may not be conserved across runs (depending on your CI setup) unless you take steps for it to do so

happy-kitchen-89482

09/14/2023, 9:24 AM

You can try using this option https://www.pantsbuild.org/docs/reference-python#run_against_entire_lockfile to avoid the lockfile subsetting, and see if that helps

happy-kitchen-89482

09/14/2023, 9:25 AM

The downside is that you're running all tests against all requirements, not just the subset each test actually needs, so you'll invalidate tests more frequently than is strictly necessary

happy-kitchen-89482

09/14/2023, 9:25 AM

but that may be a good tradeoff, YMMV

gifted-refrigerator-82216

09/14/2023, 9:58 AM

The experiment above I ran locally, all cache is preserved and everything happens on the same filesystem, i.e. hard-links are possible.

gifted-refrigerator-82216

09/14/2023, 10:00 AM

I tried

--python-run-against-entire-lockfile

but the problem got only worse:

Copy code

54 1 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
     54 1 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
     54 1 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl

bitter-ability-32190

09/14/2023, 11:26 AM

Are your cache and sandbox root on the same filesystem? If you're in the sandbox, try making a hardlink to the cache

gifted-refrigerator-82216

09/14/2023, 11:28 AM

Yes, same filesystem.

If you're in the sandbox, try making a hardlink to the cache

How do I do that? I'm not that familiar with pants internals..

bitter-ability-32190

09/14/2023, 11:40 AM

That's a Unix command

ln (file in cache) (local path)

gifted-refrigerator-82216

09/14/2023, 3:21 PM

I'm not sure I understand you correctly. Let's take e.g.

/tmp/pants-sandbox-WIrUhg/pytest.pex/.deps/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl

. I find the same file in the cache at:

Copy code

~/.cache/pants/named_caches/pex_root/downloads/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
~/.cache/pants/named_caches/pex_root/installed_wheel_zips/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl

Do you suggest to replace the file in the sandbox with a hard link to the cache? I.e.:

ln ~/.cache/pants/named_caches/pex_root/downloads/<hash>/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl /tmp/pants-sandbox-WIrUhg/pytest.pex/.deps/torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl

(or the other cache file, I don't know) I'm not sure I understand the goal. That sandbox was populated by pants and I only preserved to examine what's going on. What would be the point of a manual hard link? I'm looking for a way to prevent pants from copying identical wheels many times to its sandboxes. If pants would use a hard-link here, that would solve my problem.

bitter-ability-32190

09/14/2023, 3:24 PM

Sorry, this is a test to see if the hardlink is even possible. E.g. testing the "are these on the same filesystem" answer 🙂

gifted-refrigerator-82216

09/14/2023, 3:28 PM

Ooh right 😅 And rightfully so because unexpectedly I get

ln: failed to create hard link

. I'll review and adopt my setup to get this fixed. Thank you so far!

bitter-ability-32190

09/14/2023, 3:29 PM

😄 Glad we got that solved

bitter-ability-32190

09/14/2023, 3:29 PM

And hope you enjoy those sweet sweet hard links 😉

gifted-refrigerator-82216

09/14/2023, 4:24 PM

Ok, I made sure that the cache is in

/tmp

(via

XDG_CACHE_HOME=/tmp ./pants ...

, which is a bad idea but proves the solution) and suddenly I see:

Copy code

1 2 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
     21 22 1842032500 torch-2.0.1+cu117-cp310-cp310-linux_x86_64.whl
     21 22 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
      1 2 719346326 nvidia_cudnn_cu116-8.4.0.27-py3-none-manylinux1_x86_64.whl
     42 43 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl
      6 7 498017674 tensorflow-2.8.1-cp310-cp310-manylinux2010_x86_64.whl

So all instances have hard-links attached! Yay, problem solved! 🥳 Thanks a lot @bitter-ability-32190! =) (Hindsight: I was running pants in a docker container, with

~/.cache/

being shared from the host via

docker run --volume ...

and

/tmp

being in the container. On the host, it's all on the same SSD but from container perspective that seems to be different filesystems.)

✅ 1

👆 1

happy-kitchen-89482

09/15/2023, 11:23 AM

That's a fun docker gotcha for people to cut themselves on :)

gifted-refrigerator-82216

09/15/2023, 3:39 PM

To conclude, I'd like to highlight the subtlety of this issue. It's easy to step into it: by default cache is in

~/.cache/pants

while sandboxes are in

/tmp

and it's not uncommon for both to be on different filesystems, especially on CI. And if you happen to step into this trap, you don't really notice except for increased write load and disk consumption (if you happen to check that). The adverse effect is amplified by number of tests and the size of their dependencies. In our case ensuring that hard-links are possible reduced the test time by a factor of 4-8 (depends on disk speed). Also SSD bandwidth is no longer saturated. So for us the effect is huge. For these reasons I'd propose to mention this issue on the troubleshooting page. And maybe even consider a runtime info message with a link to that issue in case pants is unable to use hard-links when populating its sandboxes.

bitter-ability-32190

09/15/2023, 4:15 PM

Our docs live in repo if you wanna make a PR 🙂 Also the hardlinks are a veeeeery recent change (introduced in 2.17.0!). What you say isn't uncommon, though and I think we do want to find ways to get the hardlinking improvement into more hands. So think about this as a stepping stone on a longer path, and please do share ideas for future stones (like the ones you just suggested 🙂)

enough-analyst-54434

09/16/2023, 10:41 AM

@happy-kitchen-89482 it's more than a fun docker gotcha, I think it's basically only a non gotcha on Mac. Having /tmp on tmpfs is very common in modern vanilla consumer Linux.

gifted-refrigerator-82216

09/18/2023, 8:29 AM

One last question that I didn't understand yet. Why are wheels copied into a sandbox at all? I was assuming that pants caches venvs where those dependencies are already pre-installed, so there would be no need to copy the wheels.

enough-analyst-54434

09/18/2023, 9:12 AM

The venv may or may not be cached. Instead of try, detect special sort of failure (no venv), then add wheels to sandbox, then retry, the more generic, populate the sandbox with needed inputs then run is taken. This works robustly across supported languages and processes that may or may not use caches.

✅ 1

enough-analyst-54434

09/18/2023, 9:12 AM

At the Pants engine level, Python is really not a thing. It's all just processes.

enough-analyst-54434

09/18/2023, 9:14 AM

Maybe more clearly: the venvs are not cached by Pants, they are cached by Pex which is an opaque subprocess Pants runs.

enough-analyst-54434

09/18/2023, 9:18 AM

What would be useful here, and Blaze had it, was a sandbox populated via a lazy filesystem that only materialized files when actually read.

enough-analyst-54434

09/18/2023, 9:20 AM

Pants has never found the time to invest in technologies like this though that are ultimately pretty platform specific. There is an unused buildfs fuse filesystem in the codebase though; so some of the work is there for a poor-mans lazy fs.

enough-analyst-54434

09/18/2023, 9:30 AM

But really, Pants even allowing you to have sandboxes and filestores on different filesystems at all probably doesn't make sense. It seems like a footgun degree of option freedom / bad defaults that Pants should just remove.

4 Views

Open in Slack

Previous Next