FYI <@U06A03HV1> Eric's <https://github.com/pantsb...
# development
b
FYI @witty-crayon-22786 Eric's https://github.com/pantsbuild/pants/pull/16761 is prep-work for the lint change (https://github.com/pantsbuild/pants/pull/16735) because: • Pylint receives all field sets and then needs to partition (by ICs/resolve) • The ICs/resolve metadata is attached to each partition •
lint
then batches the partitions and hands those batches with their metadata back to
pylint
for running • Therefore the metadata must be valid to each batch in a partition, because the metadata isn't batched So we need to make the partition metadata be performantly valid for all possible batches of the partition's FieldSets. So, that'd the context behind the change, in case that informs your review thoughts. Eventually the pylint runner gets a bag of field_sets and static metadata applying to that-bag-or-bigger
w
yea, comment still applies.
lint
would want to partition
CoarsenedTargets
then, probably.
👍 1
b
Thanks for thinking through it 👍👍 much appreciated
w
at least assuming that
pylint
and
mypy
both need transitive deps, rather than just one level of deps, for example
b
... wait lol just double-read that. Can you elaborate? Can we not partition by ICs/resolve, then when running pylint request CoarsenedTargets on the field set batch? Or is that what you're suggesting?
w
Can we not partition by ICs/resolve, then when running pylint request CoarsenedTargets on the field set batch?
no: as mentioned in my comment, that would trigger #11270: https://github.com/pantsbuild/pants/pull/16761#discussion_r962063128
you want to compute CTs once, and then re-use them, rather than computing them in each caller (ditto not computing TTs in a bunch of spots)
if we fix #11270, it might be another story… but that’s lower in my queue than the docker/remote-exec stuff right now
no: as mentioned in my comment, that would trigger #11270
hm. so, i suppose i didn’t mention in there that #11270 applies to both CT and TT. the thing that #11270 is about is avoiding making lots of requests for CTs/TTs (i.e. a request per root or partition), if possible.
@bitter-ability-32190: anyway: happy to help unblock https://github.com/pantsbuild/pants/pull/16735 , but i don’t actually understand the blocker right now
b
I guess what we could do is attach the CT/TT for a partition with the partition, and in the "runner" construct the sandbox using a possibly-over-populated sandbox for the partition's CT/TT (even though we're linting a subset). And the reason we'd do it is because materializing a potentially-larger sandbox is cheaper than calculating CT/TT in each "runner". Do I have that right?
w
I guess what we could do is attach the CT/TT for a partition with the partition, and in the “runner” construct the sandbox using a possibly-over-populated sandbox for the partition’s CT/TT (even though we’re linting a subset).
it’s more that what you partition is CoarsenedTargets, not Targets.
so there shouldn’t be any over-large sandboxes
b
Well the result from the per-tool partitioning is then split by
lint.py
into "batches" which are subsets. So either we get the CT/TT for the "batch" in the runner, or we've passed a hint along from the partitioner (which would be maybe over-specified for the batch)
w
there might be a pluralization issue here i think: if you have a list of CTs (
CoarsenedTargets
), then you can partition it into one or more lists of CTs
or batch, or whatever word
i.e.
Copy code
def partition(cts: CoarsenedTargets) -> list[CoarsenedTargets]:
    ..
b
Example: •
lint.py
finds 26 field sets:
A..Z
lint.py
requests partitions from
pylint
pylint
partitions the field sets by some criteria and attaches some metadata. ◦ Let's say its by vowel/consonant ◦ So
pylint
returns back 2 partitions:
[A,E,I,O,U]+"vowel"
and
[<consonant>]+"consonant"
lint.py
takes that and batches it for cache perf. ◦ let's say batch size is 5 ◦
lint.py
gives the
pylint
runner 6 _batches_: ▪︎
[A,E,I,O,U]+"vowel"
+ 5 batches of
[<consonant subset>]+"consonant"
w
lint.py
requests partitions from
pylint
pylint
partitions the field sets by some criteria and attaches some metadata.
don’t partition field sets: partition CTs
and yea, this is annoying. fixing 11270 would make this less/un-necessary.
if you have time to dive in on that one i can try to sketch out more of the design. it’s going to involve some munging in
<http://graph.rs|graph.rs>
though
b
don’t partition field sets: partition CTs
In
pylint
? or in
lint
? I think I'm still off one stepping stone 🙂
w
both. to avoid 11270 you’d want to preserve the CTs all the way through
b
I don't think
lint.py
should be partitioning by `CT`s, no? Thats wasteful for non-dependent linters
1
w
mm, yea.
pylint
being in
check
would make this easier, since all of those implementations require transitive deps.
b
That ship sailed 😛
w
think of it this way: you want to make as few calls to create
TTs
/
CTs
as you can. if you can’t create them at the very beginning of the process, then create them as early as you can
but it’s just about performance (avoiding 11270). it’s not less correct.
b
We'll be constructing the sandbox for each batch (subset of partition) using the full partitions TTs/CTs since we assume materializing that is cheaper perf-wise than re-computing
w
possibly. pylint could probably fix/subset the CTs it receives with the batch
but yea
i.e.
subset = [ct for ct in cts if any(m in $mybatch for m in ct.members)]
b
booyah
So the takeaway here is: • In Eric's https://github.com/pantsbuild/pants/pull/16761 the Partition object needs to hold the CTs • In my PR (post-Eric's change) we attach the CTs as metadata to the partition, then in the runnerr use-and-subset it
w
sounds right
b
🙏 thank you
Ah I see it now. We can make one call to
CoarsenedTargets
for all field sets, then subset that as necessary!
w
yea
b
Another piece of this I maybe should've brought up earlier is that the core
lint
code doesn't actually care that the partitions/batches are of field sets after the initial partitioning. The only reason it matters is the dataclass attribute name. I'm hoping to keep it that way, for sanity 🙂 (The alternative being to split a partition of CTs into batches of CTs)
w
ah, didn’t realize you had also editted it
b
Oh I posted the comment just after I pushed ,🙂
Oh hmmmmmmmmm, one issue with this I think is that if any target in the closure of the partition changes, it'll invalidate all batches. As the metadata attached is for the entire partition. I think I'm gonna change the PR to have batches be type-agnostic, then we can batch the
CorsenedTargets
. Bueno?
h
Oh hmmmmmmmmm, one issue with this I think is that if any target in the closure of the partition changes, it'll invalidate all batches. As the metadata attached is for the entire partition.
that sounds right
I think I'm gonna change the PR to have batches be type-agnostic, then we can batch the CorsenedTargets.
vs?
b
Right now it expects
field_sets
, however it isn't a type-restriction but just a matter of the name of the attribute (it made sense to do it this way and involved less refactors to keep the old name). Once`lint.py` gets a partition object, it not longer cares about the type past the fact that it can make batches (so an iterable). The batches are then passed back to the "runner" and results are collected
h
Sounds reasonable to me. Maybe land this as a prefactor?
b
I'm not sure that's possible 😞 It's all intertwined
Oh but maybe it would work as an annoying prefactor if we make the Request type a generic and change the attribute name 🙂
👍 1
So many incremental PRs 😵
OK tried it as a prefactor and it just didn't look right. Changing it up like that was really out-of-context 😕
Oh womp womp.
RequirementsPexRequest
dopes a transitive walk, as does
PythonSourceFilesRequest
Knowing that, it makes me think because of memoization we might be better off sticking to TTs
w
it should be fine either way, in the sense that in the worst case, computing the CTs adds one additional graph walk
memoization is good, but when the inputs are arbitrary like this, you can’t always be sure that those lookups aren’t going to be doing any filtering, etc.
11270 will eventually make arbitrary lookups cheaper, but not in the next few weeks.
b
The implementation certainly is more transparent with TT, since we slice FieldSets
We're no worse off (I suspect slightly better) than 2.14 even using TTs
Turned out to be easier using CTs 🙂
👍 1