I wanted to compare the performance of Pants depen...
# general
f
I wanted to compare the performance of Pants dependency inference to perhaps some naive ways of doing it myself, and I'm getting quite better performance with a single-threaded AST scanning method. I know this method doesn't have anywhere near the number of features as Pants does, but I want to better understand why my silly script (12-15 sec) performs so much better than Pants (30-45 sec) for the same codebase and queries? I assume its something involving cache, IO, or task scheduler overhead, but I'd like to hear y'all's input. More details in 🧡
How are you measuring the Pants #?
p
keep in mind that pants also looks at strings literals that not just import statements (this is useful with Django for example, where strings are used in setting files to reference various components)
h
keep in mind that pants also looks at strings literals
Only if
[python-infer].strings
is true
f
to measure pants perf, I'm nuking the lmdb_store and running
./pants dependees ...
and timing that
h
Ah re
dependees
, I strongly suspect https://github.com/pantsbuild/pants/issues/11270 is in play
f
Looking at this code, the big difference is that Pants is using a
Get
for every little thing, I would suspect there's some resource contention at the scheduler level
βœ… 1
b
I've thought about batching dependency inference, but never surfaced the thought here
Feel free to use
py-spy
https://github.com/benfred/py-spy to measure things. It's an awesome tool
h
I assume its something involving cache, IO, or task scheduler overhead,
One source of overhead is determining the Python interpreter to parse the code with, which can be 0.25-2 seconds I believe. That does get memoized w/ pantsd, but not helpful on cold run. Instead, your script uses the interpreter its executed with
f
if it were 1-2 sec difference i wouldn't worry about it
(and I'm not really worried about this either, I just need to have a good story around it since I'm going to get questions about this in a NIH environment)
πŸ‘ 1
b
Just run Pants and your script 3 times, then they're even (cuz' caching)
h
Josh there is also overhead in launching a distinct process per file to parse imports, vs doing in-memory. But the benefit is caching to disk. And we need to do it that way because interpreter selection We could batch that more, but then we lose fine-grained caching. Any change at all to the file, even whitespace, invalidates the cache, so I think fine-grained cache is pretty crucial. Altho could be worth experimenting w/ batching
f
I think the caching makes sense
also I just realized that script wasn't doing transitive dependees either
πŸ‘€ 1
fixed that tid-bit with networkx, had a negligible impact on perf
w
things like https://github.com/pantsbuild/pants/issues/13112 should hopefully make a material difference here. we’re definitely working to push down our overhead.
πŸ™πŸ» 1
b
I think similar to lint or fmt, if e batched smartly it might be a win. It should be easy to come up with numbers
h
Yeah, I'd guess the big difference here is the overhead of launching a process per file to parse
The real win is to batch multiple files in a single process, but then split out the results and cache them separately. This would require changes to the engine though.
βž• 3
To enable this splitting
b
Yeah that'd benefit others like fmt and lint as well