I wanted to compare the performance of Pants dependency infe Pants #general

I wanted to compare the performance of Pants depen...

flat-zoo-31952

03/07/2022, 6:44 PM

I wanted to compare the performance of Pants dependency inference to perhaps some naive ways of doing it myself, and I'm getting quite better performance with a single-threaded AST scanning method. I know this method doesn't have anywhere near the number of features as Pants does, but I want to better understand why my silly script (12-15 sec) performs so much better than Pants (30-45 sec) for the same codebase and queries? I assume its something involving cache, IO, or task scheduler overhead, but I'd like to hear y'all's input. More details in 🧵

flat-zoo-31952

03/07/2022, 6:44 PM

My script is here: https://gist.github.com/jriddy/8840839d2a3506eb6787683535a8e1b8

bitter-ability-32190

03/07/2022, 6:47 PM

Pants does just use

ast

to parse the inputs. The Pants rule code is here: https://github.com/pantsbuild/pants/blob/main/src/python/pants/backend/python/dependency_inference/parse_python_imports.py and the parser is here: https://github.com/pantsbuild/pants/blob/main/src/python/pants/backend/python/dependency_inference/scripts/import_parser.py

bitter-ability-32190

03/07/2022, 6:48 PM

How are you measuring the Pants #?

polite-garden-50641

03/07/2022, 6:48 PM

keep in mind that pants also looks at strings literals that not just import statements (this is useful with Django for example, where strings are used in setting files to reference various components)

polite-garden-50641

03/07/2022, 6:49 PM

https://github.com/pantsbuild/pants/blob/94921a8edb9b142dff471784aef9da22b58cd0ab/[…]ts/backend/python/dependency_inference/scripts/import_parser.py

hundreds-father-404

03/07/2022, 6:49 PM

keep in mind that pants also looks at strings literals

Only if

[python-infer].strings

is true

flat-zoo-31952

03/07/2022, 6:49 PM

to measure pants perf, I'm nuking the lmdb_store and running

./pants dependees ...

and timing that

hundreds-father-404

03/07/2022, 6:50 PM

Ah re

dependees

, I strongly suspect https://github.com/pantsbuild/pants/issues/11270 is in play

flat-zoo-31952

03/07/2022, 6:51 PM

Looking at this code, the big difference is that Pants is using a

Get

for every little thing, I would suspect there's some resource contention at the scheduler level

✅ 1

bitter-ability-32190

03/07/2022, 6:51 PM

I've thought about batching dependency inference, but never surfaced the thought here

bitter-ability-32190

03/07/2022, 6:52 PM

Feel free to use

py-spy

https://github.com/benfred/py-spy to measure things. It's an awesome tool

hundreds-father-404

03/07/2022, 6:52 PM

I assume its something involving cache, IO, or task scheduler overhead,

One source of overhead is determining the Python interpreter to parse the code with, which can be 0.25-2 seconds I believe. That does get memoized w/ pantsd, but not helpful on cold run. Instead, your script uses the interpreter its executed with

flat-zoo-31952

03/07/2022, 6:53 PM

if it were 1-2 sec difference i wouldn't worry about it

flat-zoo-31952

03/07/2022, 6:53 PM

(and I'm not really worried about this either, I just need to have a good story around it since I'm going to get questions about this in a NIH environment)

👍 1

bitter-ability-32190

03/07/2022, 6:54 PM

Just run Pants and your script 3 times, then they're even (cuz' caching)

hundreds-father-404

03/07/2022, 6:56 PM

Josh there is also overhead in launching a distinct process per file to parse imports, vs doing in-memory. But the benefit is caching to disk. And we need to do it that way because interpreter selection We could batch that more, but then we lose fine-grained caching. Any change at all to the file, even whitespace, invalidates the cache, so I think fine-grained cache is pretty crucial. Altho could be worth experimenting w/ batching

flat-zoo-31952

03/07/2022, 6:57 PM

I think the caching makes sense

flat-zoo-31952

03/07/2022, 6:58 PM

also I just realized that script wasn't doing transitive dependees either

👀 1

flat-zoo-31952

03/07/2022, 7:08 PM

fixed that tid-bit with networkx, had a negligible impact on perf

witty-crayon-22786

03/07/2022, 7:14 PM

things like https://github.com/pantsbuild/pants/issues/13112 should hopefully make a material difference here. we’re definitely working to push down our overhead.

🙏🏻 1

bitter-ability-32190

03/07/2022, 7:15 PM

I think similar to lint or fmt, if e batched smartly it might be a win. It should be easy to come up with numbers

happy-kitchen-89482

03/07/2022, 7:30 PM

Yeah, I'd guess the big difference here is the overhead of launching a process per file to parse

happy-kitchen-89482

03/07/2022, 7:30 PM

The real win is to batch multiple files in a single process, but then split out the results and cache them separately. This would require changes to the engine though.

➕ 3

happy-kitchen-89482

03/07/2022, 7:31 PM

To enable this splitting

bitter-ability-32190

03/07/2022, 7:33 PM

Yeah that'd benefit others like fmt and lint as well

4 Views

Open in Slack

Previous Next