Looking at integration of pants into CI: we're usi...
# general
e
Looking at integration of pants into CI: we're using CircleCI, which uses named caches usually based on the hash of a file (so you'd say have a pip cache based on the hash of
requirements.txt
which is calculated when
requirements.txt
changes, but until requirements.txt changes, that cache doesn't appear to be rebuilt). We can get an approximation to that by manually looking at some files pants uses (can proxy that all requirements.txt files could be hashed together, say) but there are also some internal dependencies that pants sorts through while creating its caches. So my question--is there any way to extract that from pants, to tell when the cache has to be rebuilt? Or is it better to just make a best guess based on all the
requirements.txt
and maybe
BUILD
files hashed together? (sorry if that's unclear; fair amount of flailing here and GARGANTUAN pants caches; still working on the advice in https://www.pantsbuild.org/docs/using-pants-in-ci but ever since our developers started using
torch
any cache just gets punishingly large)
f
i know this won't solve the problem, but you can try using the CPU torch wheels from https://download.pytorch.org/whl/torch_stable.html which are smaller (although still huge by normal standards)
as far as caching goes....what are you trying to cache here? The transitive closure of your deps?
e
Both the three pants cache directories (as described in "Using Pants in CI") and the pip cache, ideally. Right now I took the simple expedient of concatenating all the
requirements.txt
files for a checksum, which does the trick for reasonable values of "the trick"--but the caches are indeed massive (actually something didn't work correctly for the pip cache perhaps, but it's still a work in progress).
f
okay there's a couple of things here: 1. I think you need to find a way to just let pants be the one to determine whether to use the cache layer or not. It has it's own cache miss/fail logic, and it doesn't really expose that afaict. Trying to guess what will lead to pants having a cache miss will probably lead to a ton of subtle errors. 2. Pants wants to treat its cache dirs as normal mutable directories, and then implement its own immutable cache logic on top of that construct. If CircleCI is snapshotting and saving those directories for every run, you're probably going to wind up with a huge amount of waste. 3. Pants has its own pip cache that it keeps in
named_caches
but I think the torch wheel should only be downloaded once there. That said, I don't think anyone has a good CI solution for how to deal with an 800 MB library. Pants tends to cache requirements as pexes, which when combined with a big lib like that can lead to a lot of bloat. I think there's been some work done on this, but I'm not sure. Might need to summon @hundreds-father-404 to know more.