Hey, so I started setting up pants in CI, and from...
# general
r
Hey, so I started setting up pants in CI, and from reading the doc https://www.pantsbuild.org/docs/using-pants-in-ci It is recommended to cache the
$HOME/.cache/pants
folder between steps/build for faster execution. The catch is that even when running a
./pants
goals on only changed targets (with
--changed-since=origin/main
), pants starts by downloading all dependencies in the impacted lockfiles, and with a
data-science
resolve that can easily get to ~10Gbs of cache to upload/download. Is there any way around this ? or Am i doing something wrong^^, since it's indeed slower to push/pull 10Gigs of cache each time
image.png
w
stashing portions of the cache is definitely an option. if you keep only
$HOME/.cache/pants/lmdb_store
(for example), you will be able to hit for exact matches, but you wonโ€™t keep a generic PIP cache (under
$HOME/.cache/pants/named_caches
), and will need to re-resolve from scratch after requirements change
but also: pants supports native remote caching, which uploads/downloads precise artifacts, to avoid the need to grab things from local disk
r
hmm but it will not be particularly helpfull when running on only changed targets, since they have changed and will need the pip cache to run
w
@rapid-crayon-8232: not quite: if your requirements have not changed, then the resolve will not have changed, and can stay in the cache
you can hit the cache for your thirdparty requirements, and then it will be consumed to actually run changed targets
p
perhaps use the hash of the reqs file as part of the CI system cash key. example (for GH actions, you can use the hashFiles expression): https://docs.github.com/en/actions/learn-github-actions/expressions#hashfiles Other CI systems have other ways to do this.
h
@rapid-crayon-8232 Our company (Toolchain) offers remote caching as a service, if you're interested in trying that out!
r
I had the same question a few days ago when integrating with GitHub Actions. Noticed the cache action was tarring/uploading ~7G when I naively cached the entire
lmdb_store
and
named_caches
directories. We use self-hosted runners on azure kubernetes. So, ended up mounting a persistent volume in each runner to enable persisting the cache locally between runs. Works well for our specific use case.
h
Are y'all doing the caching action yourself instead of a tool that is compatible with REAPI like the docs describe here.
h
We implement REAPI
Since that's what it's for ๐Ÿ™‚
h
Just to stress how flexible Pants set up is, we were able to integrate it with an already existing deployment of https://github.com/buchgr/bazel-remote that we had with no changes.
Sorry @happy-kitchen-89482, my question was meant for @rhythmic-battery-45198 and others ๐Ÿ™‚
h
Oh, hah ๐Ÿ™‚
r
@rhythmic-battery-45198 ended up doing the same since our runners are already self hosted and it was pretty easy to setup ^^ @witty-crayon-22786 good to know, dependencies indeed do not change very often so it very doable solution. @happy-kitchen-89482 thanks, we're still evaluating how we want to setup our monorepo right now, but it's definitely an option in the future.
r
@high-yak-85899 We don't have an explicit cache step as part of the github action workflow. We removed the cache step since the cache directory is persisted on the github runner vm's disk. I intend to look into remote caching when I have more time but went that route as a quick solution.