https://pantsbuild.org/ logo
f

flat-zoo-31952

03/07/2022, 6:44 PM
I wanted to compare the performance of Pants dependency inference to perhaps some naive ways of doing it myself, and I'm getting quite better performance with a single-threaded AST scanning method. I know this method doesn't have anywhere near the number of features as Pants does, but I want to better understand why my silly script (12-15 sec) performs so much better than Pants (30-45 sec) for the same codebase and queries? I assume its something involving cache, IO, or task scheduler overhead, but I'd like to hear y'all's input. More details in 🧡
How are you measuring the Pants #?
p

polite-garden-50641

03/07/2022, 6:48 PM
keep in mind that pants also looks at strings literals that not just import statements (this is useful with Django for example, where strings are used in setting files to reference various components)
h

hundreds-father-404

03/07/2022, 6:49 PM
keep in mind that pants also looks at strings literals
Only if
[python-infer].strings
is true
f

flat-zoo-31952

03/07/2022, 6:49 PM
to measure pants perf, I'm nuking the lmdb_store and running
./pants dependees ...
and timing that
h

hundreds-father-404

03/07/2022, 6:50 PM
Ah re
dependees
, I strongly suspect https://github.com/pantsbuild/pants/issues/11270 is in play
f

flat-zoo-31952

03/07/2022, 6:51 PM
Looking at this code, the big difference is that Pants is using a
Get
for every little thing, I would suspect there's some resource contention at the scheduler level
βœ… 1
b

bitter-ability-32190

03/07/2022, 6:51 PM
I've thought about batching dependency inference, but never surfaced the thought here
Feel free to use
py-spy
https://github.com/benfred/py-spy to measure things. It's an awesome tool
h

hundreds-father-404

03/07/2022, 6:52 PM
I assume its something involving cache, IO, or task scheduler overhead,
One source of overhead is determining the Python interpreter to parse the code with, which can be 0.25-2 seconds I believe. That does get memoized w/ pantsd, but not helpful on cold run. Instead, your script uses the interpreter its executed with
f

flat-zoo-31952

03/07/2022, 6:53 PM
if it were 1-2 sec difference i wouldn't worry about it
(and I'm not really worried about this either, I just need to have a good story around it since I'm going to get questions about this in a NIH environment)
πŸ‘ 1
b

bitter-ability-32190

03/07/2022, 6:54 PM
Just run Pants and your script 3 times, then they're even (cuz' caching)
h

hundreds-father-404

03/07/2022, 6:56 PM
Josh there is also overhead in launching a distinct process per file to parse imports, vs doing in-memory. But the benefit is caching to disk. And we need to do it that way because interpreter selection We could batch that more, but then we lose fine-grained caching. Any change at all to the file, even whitespace, invalidates the cache, so I think fine-grained cache is pretty crucial. Altho could be worth experimenting w/ batching
f

flat-zoo-31952

03/07/2022, 6:57 PM
I think the caching makes sense
also I just realized that script wasn't doing transitive dependees either
πŸ‘€ 1
fixed that tid-bit with networkx, had a negligible impact on perf
w

witty-crayon-22786

03/07/2022, 7:14 PM
things like https://github.com/pantsbuild/pants/issues/13112 should hopefully make a material difference here. we’re definitely working to push down our overhead.
πŸ™πŸ» 1
b

bitter-ability-32190

03/07/2022, 7:15 PM
I think similar to lint or fmt, if e batched smartly it might be a win. It should be easy to come up with numbers
h

happy-kitchen-89482

03/07/2022, 7:30 PM
Yeah, I'd guess the big difference here is the overhead of launching a process per file to parse
The real win is to batch multiple files in a single process, but then split out the results and cache them separately. This would require changes to the engine though.
βž• 3
To enable this splitting
b

bitter-ability-32190

03/07/2022, 7:33 PM
Yeah that'd benefit others like fmt and lint as well
4 Views