https://pantsbuild.org/ logo
b

bitter-ability-32190

04/28/2023, 7:30 PM
Also metrics from "intrinsic parsing" are IN https://github.com/pantsbuild/pants/pull/18854 TL;DR: • In the worst case performance of dependency inference in the Pants repo went from 72.388 s ± 10.145 s down to 15.385 s ± 0.150 s. Larger repos = larger rewards. • In the best case (super hot cache) performance didn't change. ◦ We're not consulting the cache anymore, we're just always re-parsing. ◦ Put another way, reparsing the files is just as fast as computing cache keys and pulling from the cache.
🏎️ 2
🙌 1
🚀 3
CC @witty-crayon-22786 and @happy-kitchen-89482 This was a fun thing to get working on at PyCon ❤️
I wonder if we could do the same thing using tree sitter generated rust code for any number of langugaes
w

witty-crayon-22786

04/28/2023, 7:41 PM
very nice 😃
relatedly: i don’t think i’ve ever done a benchmark of “maximally tiny” process spawning… would be interesting to see what our overhead is in that case, for if we ever do want to put it out-of-process
b

bitter-ability-32190

04/28/2023, 7:45 PM
Yeah. I plan on putting a pretty bow on this in a different crate, and exposing it as a CLI with a tiny python shim. It's a useful utility outside of pants
h

happy-kitchen-89482

04/28/2023, 9:12 PM
Neat!!
We probably still want to cache, for large repos?
b

bitter-ability-32190

04/28/2023, 9:27 PM
We don't have to. That's the beautiful thing. The cache is just as slow as re-parsing. Meaning we're only every strictly faster. But that's not to say we don't sprinkle in a little local caching to make it EVEN FASTER
I should point out, the result is still memoized.
b

broad-processor-92400

04/28/2023, 9:44 PM
Would caching still be useful with a larger batch size? (Ie not paying for overhead per-file?)
b

bitter-ability-32190

04/28/2023, 9:47 PM
Ummmm put another way?
This way doesn't have any overhead, realistically, for the number of files because this is happen on rule threads
w

witty-crayon-22786

04/28/2023, 10:22 PM
The cache is just as slow as re-parsing.
or close enough as to not matter, probably… but can confirm by looking at the workunits
b

bitter-ability-32190

04/28/2023, 10:27 PM
Yeah it'd be great to back that claim up with hard(er) numbers. I think sprinkling in the local cache (that the URL verifier uses) would make this an all-around win
h

happy-kitchen-89482

04/29/2023, 2:23 AM
Yeah, I guess the next perf frontier for import extraction would be running in batches
b

bitter-ability-32190

04/29/2023, 9:28 AM
To be very crystal clear here (because I'm getting a little confused). This thing does no caching whatsoever. It also doesn't run any processes. In rule threads in the Pants process we just rip through Python files at lightning speed. So there is nothing to batch. At least for this implementation. If you're talking about other languages, then yeah maybe. Or just do more of this, because the numbers are AMAZING
h

happy-kitchen-89482

04/29/2023, 1:48 PM
So now I'm confused too - this is an engine intrinsic? (I should look at the code...)
b

bitter-ability-32190

04/29/2023, 1:49 PM
Yup. (You might want to reread the first sentence of the original post 😅)
h

happy-kitchen-89482

04/29/2023, 1:52 PM
Hmm, very cool, but does this not violate the plugin boundary? We now have very Python-specific code in the engine itself
So it's the first time we're letting a standard plugin do something a custom plugin author can't do
b

bitter-ability-32190

04/29/2023, 1:55 PM
That's not technically true. Any plugin could embed a binary and do C calls into it instead of running a process. The reason it's ok-enough in the pants engine is because: 1. We already have a conveniently embedded binary already 2. We already were going to add the Python parsing library for the sake of parsing rule code in the engine (faster startup time). So adding this intrinsic is negligible in size
e

enough-analyst-54434

04/29/2023, 3:11 PM
FWICT the important thing to keep track of is the latency of / reliability of Python upgrade support. It looks like 2.12 only adds 1 new bit of supported syntax for bindings in comprehensions, but that presumably would be an issue here to Pants supporting Python 2.12 in October. Since ruff uses this and is popular, it may be this is all fine and upgrades will happen promptly.
b

bitter-ability-32190

04/29/2023, 3:14 PM
The parser is autogenerated, for the most part 😉
e

enough-analyst-54434

04/29/2023, 3:17 PM
Sure, but I don't believe that fully answers the concern. At any rate, I too am a bit concerned at a broader view that we're having to hack around fundamental slowness in the infra. This is probably a good win in the short to medium term - it's invisible and reversible without disrupting users if need be - but the latency issues with the model are troubling.
w

witty-crayon-22786

04/29/2023, 3:36 PM
The alternative "infrastructure" in this case is a bare invoke of a python interpreter. Yea, python should be faster. And yes, we could hypothetically use nailgun with Python. And batching has plenty of challenges and complexity. But there also comes a point where it is simpler to use a new language. I'm in favor of doing that in this case. If people have an issue with it being an intrinsic, then it could be an external process instead, but I don't personally think that that is necessary, for the same reason we don't have swappable released versions of the dep extraction logic currently.
b

bitter-ability-32190

04/29/2023, 6:20 PM
Also doesn't have to be a process. We could just call a C API as well from rule code. That's my point, we're just putting it where it's convenient because we're already shopping a binary which has (or was about to have) a Python AST visitor.
(although to be completely fair, the intrinsic gets free memoization)
Oh wow. So I'm also considering tree sitter, and although the interface is kinda hard to grok (using string queries in its own query language) it would be able to support Python3/Python2/etc...
3 Views