Hey team, I have a couple of beginner questions ab...
# general
g
Hey team, I have a couple of beginner questions about enhancing the performance of our processes that gather dependees. Would be awesome if someone could help me to understand these questions better. Thanks a lot! 🧡
πŸ‘€ 1
This is our current py-spy performance svg.
Questions: 1. We have seen
validate_python_dependencies
at the end after gathering dependees. Is it possible to skip those processes to save some time? Thanks! 2. For find_owners, do we have some recommendations for improving the performance, I’ve seen a long waiting duration before executing the code, and I’ve seen a couple of threads regarding this, but any advice would be helpful. 3. Overall, our current process for collecting dependees runs within 5~6 minutes, do we think remote caching can save some time for the current process?
e
1 & 2 total ~20% of runtime in your flame graph; so if those were optimized to 0 you'd still have 4-5 minutes which seems not much better at all. I guess the 1st question is what Pants command are you running and what are you trying to achieve with it?
b
For 3, pants 2.17 ships with experimental support for parsing dependencies in Rust. It's actually about as fast as pulling from the old process cache locally, and therefore is faster than looking up in the remote cache. If you can, try it out and report back πŸ™‚
πŸ‘ 1
🫑 1
g
@bitter-ability-32190 Thanks for the advice, could you please give me some advice on applying
parse_python_dependencies
for our use cases. We tried to build a dependency graph for our repo, then select the tests that needed to be run in our CI system based on the specs input. I have some difficulties connecting our current subsystem with the
PythonInferSubsystem
together. Thanks a lot, any advice would be helpful!
b
I think I need more context, but I'm not really sure what you mean πŸ˜…
g
For 3, pants 2.17 ships with experimental support for parsing dependencies in Rust. It’s actually about as fast as pulling from the old process cache locally, and therefore is faster than looking up in the remote cache. If you can, try it out and report back
Sorry about that, just want to follow your comments regarding this one, I did some research on how to use the rust dependency parse and would be great if you can point me to the correct direction for using it.
b
g
Copy code
time ./pants dependents app/xxx --stats-log --python-infer-use-rust-parser
./pants dependents app/xxx --stats-log --python-infer-use-rust-parser   0.49s user 0.20s system 0% cpu 3:48.40 total
My current observations are: 1.
--python-infer-use-rust-parser
could not boost the performance for
./pants dependents
or
(MultiGet/Get dependencies)
2. We use a local cache, but it does not help the rerun for
./pants dependents
commands. We always need to spend 5+ mins creating the dependencies graph. Attached is the latest pyspy svg.
b
Do you have custom dependency inference rules? How many files do you have?
g
We have 50000+ files and yes we have our own custom dependency inference rules.
b
OK, then that timing doesn't surprise me πŸ˜…
(although I havent seen the py-spy)
g
Just curious, if I run
./pants dependents xxx
twice, should we cache the results from the first run and speed up the second run? I am just curious whether we could utilize more caches in our CI builds.
b
So there's the daemon, which makes things go super speedy fast (in-process caching). If you exceed the memory limit, or a few other things, the daemon restarts. Then there's the process cache. If Pants has run a particular process in the past it'll pull from this. Then the rust parser uses a separate on-disk cache.
g
After bumping the pantsd cache from 2GiB to 12GiB, the duration went down from 240 secs to 48 secs. πŸ”₯πŸ”₯πŸ”₯πŸ”₯πŸ”₯πŸ”₯πŸ”₯πŸ”₯
A follow-up question will be how to evaluate the correct size for running my pantsd in CI, currently, we don’t have a concrete number but would like to understand how to find the correct number.