With the engine, we’ve now accumulated a couple pa...
# development
h
With the engine, we’ve now accumulated a couple pair types like
AddressWithOrigin
and
TargetWithOrigin
. We’re using dataclasses to represent this. I’d like to instead use
typing.NamedTuple
, which has the exact same
class
definition as a dataclass, including allowing us to define methods, but is more performant and allows us to unpack the pair:
Copy code
target, origin = target_with_origin
But, the one downside is that
TargetWithOrigin(tgt, origin) == (tgt, origin) == DistinctNamedTuple(tgt, origin)
. (
TargetWithOrigin != tuple != DistinctNamedTuple
, at least). I don’t think this is actually an issue? We don’t care if
TargetWithOrigin(tgt, origin) == (tgt, origin)
occurs and because the engine calls
type()
on instances to use exact type IDs, we would fail if a rule author forgot to wrap the tuple in its corresponding NamedTuple?
a
i would really prefer if the class name was required to destructure, like rust tuple structs, and we may want to make that a convention
otherwise i don’t see a problem and ive had a TODO in my head to allow destructuring via tuples for a while
👖 1
h
i would really prefer if the class name was required to destructure,
I don’t think we generally want that for complex types: see https://www.python.org/dev/peps/pep-0557/#why-not-just-use-namedtuple for why this makes inheritance adding new fields tricky But for pair types like
XWithY
, we should never be adding a 3rd field so it would be super nice to have.
a
so this would be an awesome change
🔥 1
Instances are always iterable, which can make it difficult to add fields.
i don’t like this part
oh!!!
that’s for namedtuple()
👍 1
h
Yes, the PEP is explaining how allowing you to unpack something like
NamedTuple
means it’s much more difficult to add a new field to the class than before because you break all call sites that now need to unwrap that new field
I think dataclasses are working well for most engine types, e.g.
Target
and
StripTargetRequest
. One nice feature is having custom constructors that can be frozen after (thanks again for that decorator!). Where we can optimize is the pair types we have like
AddressWithorigin
a
i don’t know how i feel about that. changing every call site feels like something that’s very hard to automate and that i’d prefer not to do. if it was possible to destructure while not knowing exactly how the type is laid out (like with rust’s
..
), then this seems like it’d be a lot more useful
h
i don’t know how i feel about that.
Feel about what? About using
NamedTuple
with pair types like
AddressWithOrigin
? They can keep using named properties like
address_with_origin.address
if they want! No need to rewrite those call sites. Only, now, there’s an additional technique they can use to destructure via tuple unpacking
a
ok!
w
haven't read the rest of the thread, but: those pair classes could almost certainly be avoided with the "multiple Params to a Get" change
if so... would be good to do that rather than continue to create or optimize them
a
oh ugh i have a branch for that
not even close to working yet
w
if this is a pressing need, i can take a look at it today/tomorrow rather than doing async/await stuff
h
Not a pressing need at all, imo. Multiple params is solely boilerplate reduction, which is valuable but I think Rust async/await would be more impactful. I’m still pretty unclear too how multiple params would work with collection types. It’s a common idiom for a goal rule to request
TargetsWithOrigins
and
AddressesWithOrigins
. What would that look like with multiple params?
w
oh... unclear.
h
That’s part of the reason I don’t see a strong benefit to being able have the equivalent of
AddressWithOrigin
via multiple params. `NamedTuple`s are exceptionally lightweight (Py 3.9 made an optimization where their attribute access is the fastest in Python, apparently). It’s also a very nice construct for programmers. They can easily wrap their head around having “both an Address and an OriginSpec” - it’s a plain and simple Python idiom with no new engine magic. Another consideration. For
AddressesWithOrigins
and
TargetsWithOrigins
, we must preserve the method
.expect_single()
.
w
both of those are computed from some other structure though, yea?
a dict?
not sure here... but collection classes like that generally evolve into something purpose-specific. for example: "Addresses" currently implicitly means "all Addresses for the root Specs"
likewise each of those other collections.
so i don't know how they'll evolve over time, since it's mostly just goal_rules that should be requesting "the world" like that.
and there is always
Addresses
,
Targets
, and
Origins
h
No, see https://github.com/pantsbuild/pants/blob/eeb9d7488a39bb428bf19e273584675d1041745b/src/python/pants/engine/legacy/graph.py#L836-L890 and https://github.com/pantsbuild/pants/blob/eeb9d7488a39bb428bf19e273584675d1041745b/src/python/pants/engine/build_files.py#L213-L267. We go out of our way to preserve the
OriginSpec
at the time of first parsing the
Addresses
. We must preserve that information when we first encounter it or we will have no way of backtracking “What command line spec did
:lib
come from..?”
AddressesWithOrigins
is the root of it all. From there, we can get
Addresses
by stripping off the origins, for example; or convert it to
TargetsWithOrigins
w
where origins was a dict , and you could look up the origin for an Address or Target via address
h
where origins was a dict , and you could look up the origin for an Address or Target via address
We probably could do that..but I don’t see the benefit? There is a tight coupling between an
Address
and its
OriginSpec
. Every
Address
has precisely one
OriginSpec
. Right now, we preserve that information when we first resolve the addresses. Throwing it away and replacing it with this
Origins
singleton could work, but seems round-about
w
it involves fewer classes that are just compositions of other classes
it eliminates
TargetsWithOrigins
, in particular.
h
it involves fewer classes that are just compositions of other classes
At the expense of more magical singletons and rules needing to request both
Targets
and
Origins
, rather than simply
TargetsWithOrigins
w
fewer, right?
Addresses
already exists.
h
Huh? Yes, it does and will always exist. But you’re proposing adding a new
Origins
type
w
the net number of classes is fewer then, i think. because it would add Origins, but remove both AddressesWithOrigins and TargetsWithOrigins
anyway, not sure.
it's the difference between a function taking two parameters and one merged parameter
except that here, in the one merged parameter case you need a wrapper class
h
Indeed, 1 fewer class. But now, every call site must not only have 2 parameters now, but also when iterating over every target, must have boilerplate to reassociate that target with its OriginSpec vs. that association already being done for you
in the one merged parameter case you need a wrapper class
Keep in mind that if you don’t care about
WithOrigins
, you can still request the simpler
Addresses
. You only request
AddressesWithOrigins
when you have something meaningful to do with the origin specs.
w
i don't know what the use sites look like... is it the case that you need the origin in all cases? or only in error cases?
h
You need it for
lint
,
fmt
, and
test
so that we can calculate the precise files to run on, e.g.
./v2 test foo.py
. No other goals use it and simply request `Addresses`/`Targets`
w
k
h
or only in error cases?
We never use the
OriginSpec
for errors, at the moment. This feature is solely used for precise file arguments, which is something rule authors have to go out of their way to opt into
w
to head back to the rest of the thread:
regarding
dataclass
:
datatype
used to extend
namedtuple
👍 1
my understanding when we moved to dataclass was that it was similarly performant to namedtuple
a
yes, that was mine too
w
is that not the case? i don't think we noticed a performance difference, but it might have gotten lost in the python3 speedup.
h
my understanding when we moved to dataclass was that it was similarly performant to namedtuple
Yes, dataclass still has very good and optimized performance. But NamedTuple is even more lightweight because it’s doing less things than a fully fledged class
w
is there a reference you've seen on that?
h
For example, you might have seen how you can use
__slots__
to reduce a class’s memory footprint. Dataclasses don’t do that by default (they considered it) because it makes things much more complex. In contrast, because NamedTuple is just a wrapper around a tuple, it (I believe) uses that performance hack
w
and a bit smaller
...which is surprising.
see the
Comparison
table
h
That is..Although, it looks like they don’t test
frozen=True
, which is known to slightly reduce performance https://docs.python.org/3/library/dataclasses.html#frozen-instances
I mean, fwit, performance isn’t the real reason I want
NamedTuple
for
TargetWithOrigin
and
AddressWithOrigin
. I suspected it would be better so threw it out there as an additional carrot What I really want is tuple unpacking:
Copy code
tgt, origin = target_with_origin
w
so, related to this, it looks like implementing
Addresses
via
AddressesWithOrigins
causes about a 10-15% slowdown for
list
in
1.26.x
1
which ends up noticeable in CI times
i'll take a swing at computing them lazily tomorrow
👍 1
h
Wow. This is using address specs, correct?
Why 1.26.x? We made that change in 1.25.x iirc
w
1.25.x.svg,1.26.x.svg
doesn't look like it
h
Ohh you’re right. 1.25.x added file args, 1.26.x added precise file args