Hi team, I have a question on Mypy remote caching....
# general
g
Hi team, I have a question on Mypy remote caching. We are using remote caching in CI, however, we dont see remote caching speeds up Mypy. For example, when I start a new CI pipeline for a new branch, it takes 5 mins for pants check to complete. If I restart the same build, it is good to see that the pants check finish quickly (it confirms it is using remote caching). However, if I add a minor change (for example, adding a dummy test in a test module), the pants check takes ~5 mins again in a new CI pipeline. I am wondering if it is expected?
b
Yeah, expected. The remote caching is at the granularity of a whole of process execution, e.g. has pants already run mypy on this exact set of input files? If so, just return it directly from the remote cache (yay, speedy), because it can assume the previous invocation and this one will give exactly the same result. Mypy has its own internal cache that speeds up runs on similar input files, which is a slightly different situation: the inputs have changed, and the result might be different, but there's still work that can be reused. Pants models this as a "named cache", which is just stored locally and isn't part of the remote cache. I imagine partly because it's not as which remote cache value to choose to download based on "similar" input files (in a generic, tool-agnostic way). One way to benefit from mypy's internal caching would be ensure the
~/.cache/pants/named_caches/mypy_cache
directory persists between CI runs, e.g. if using GitHub actions, use the
actions/cache@v3
action with that path (and an appropriately invalidated cache key). https://www.pantsbuild.org/docs/using-pants-in-ci discusses these directories somewhat.
b
We also have ideas on how to model mypy differently to make caching better. The problem is unlike formatters, mypy needs to see all of the files your code transitively depends on. I've also seen that pyright is just faster, if you can stomach switching.
g
thanks for the inputs.
One way to benefit from mypy's internal caching would be ensure the
~/.cache/pants/named_caches/mypy_cache
directory persists between CI runs, e.g. if using GitHub actions, use the
actions/cache@v3
action with that path (and an appropriately invalidated cache key).
We recently switched to remote catching. We are using kubernetes agent in Jenkins. In the past, We mounted a kubernetes persist volume across our builds to share cache. Later We realized that it isn't the recommended way since lmdb used in Pants only supports local storage and we see the cache corruption sometimws. Hence, we moved to remote catching. Here are some follow-up questions, 1. Why the pants remote cache can not cache the mypy_cache as well? 2. What exactly were put into the remote cache by Pants? I was imagining Pants would put all those cache folders (e.g. lmdb_store/named_caches/etc) there but it doesn't seem the case.
b
1. Pants' local/remote cache support requires immutability and reproducinility . the mypy cache (and anything put in named cache) usually falls into one or both categories. Specifically the mypy caches is both mutable and un-reproducible 2. The local/remote cache in Pants stores essentially two things: a. Repriducible Process runs. E.g. when Pants runs a process given input files/env vars it caches the inputs->outputs b. Disk input "fingerprints" for a given file/directory subset it caches the sha256 sum -> bytes
g
Thanks for the clarification, I think I got most of it. I am curious about the details on the following
We also have ideas on how to model mypy differently to make caching better.
What is the plan to make it better. Btw, I don't think pants support pyright at the moment, am I right?
b
Pants has experimental support for pyright, actually!
We'd change how we run mypy, instead of one big sandbox with everything in it. We'd do incremental runs as we traverse the graph (not dissimilar how you might envision a compiler run would look like)