this is in the context of trying to close out <htt...
# development
h
this is in the context of trying to close out https://github.com/pantsbuild/pants/issues/7969
w
i think we had discussed it being the "first requesting Node". cc @average-vr-56795, @dry-policeman-7927
a
Yeah, I think that sounds about right... I think there was also a discussion of changing how we generate zipkin spans to be more annotation-based and less span-heavy, but that's probably not relevant for you here :)
h
what does it mean to be the first requesting Node, in terms of data structures available in the run method of
NodeKey
?
w
Would need to do it above that, in the Graph
In the Graph, you know whether a Node/entry is being "created" vs already existing
..whoops. didn't mean to send to channel
h
so maybe it doesn't make sense to create workunits in the
run
Node trait implementation at all
w
Possibly. I don't know if we need any information that is only present there
But yea: if the codepath in run doesn't use any local info, it could probably move elsewhere
a
IIRC the code moved to run so that we could get timing information at the start/end of the run, but no reason parent tracking and timing tracking need to be in the same place
h
@average-vr-56795 where was it originally?
w
You can still get start/end time in Graph::get. BUT, i think that maybe one issue night be that Graph is generic. Might need to introduce the workunitstore trait there.
a
Not sure, it may have only existed elsewhere before in review :) shouldn't have bearing on you either way
h
@witty-crayon-22786 @average-vr-56795 is there any reason why we can't re-use the concept of an
EntryId
that's available per-node in
graph/src/entry.rs
for workunit span_id?
i.e. are those actually two different concepts, or just the same concept that happens to be instantiated two separate places right now?
w
The "spanid" concept is one from distributed tracing, where each span may be generated on a different machine without copliding
While it wouldn't collide locally to use the entry id as spanid, it might on the context of remote execution, for example
h
ah that makes sense
so we should still generate them randomly from a large pool
as far as I can tell there's no reason why we can't initialize a workunit somewhere around here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L253
and the finalize it with the current time and add it to the workunit store somewhere in here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L403
other than the work of making the
graph
trait aware of
workunit_store
and then
run
itself wouldn't need to care about workunits
and I think we would need some kind of method on
Graph
to ask of a node, what its parent is
unless this already exists
it is correct to say that a node only has a single parent but multiple children, right? that's the "direction" the graph assumes?
w
No: it's a DAG
So a Node may have multiple parents: hence the question at the head of the thread
h
when conceptually does it happen that a Node has multiple parents?
say an EPR node to be concrete
w
If two tests have the same 3rdparty deps
They'll invoke pex with the same args, which is "the same" epr.
Or if two rules recurse and depend on the same inner node (maybe two libraries depend on the same library)
h
ah yeah, so the first requesting node will be the one that actually invokes
run
on that node
and the second one will see that node cached and won't enter the code path in
<http://entry.rs|entry.rs>
where
run
gets called
w
Sortof. The Graph always invokes run in a new async Task. It's more like a trampoline.
a
IIRC if we can do timing measuring in one function, rather than in different functions, that's useful, because the executor may insert arbitrary delays between futuresy functions
h
so maybe the architecture we want is that
Node::run
takes in a
Option<StartedWorkunit>
argument (just like it now accepts a
Context
), that has the parent information from the graph
and then
run
computes the total run time, and adds that to
StartedWorkunit
to make a
WorkUnit
and that's what gets added to the store
?
w
The total runtime can be computed anywhere
Because it can be attached to the resulting future.
if it's in the codepath that created the node.
h
yeah as long as the
run
method returns its start and end time, we can compute that wherever
but it's the start and end real-time of
run
that is the span of time we actually care about tracking, right?
w
No. All of this is async
Recording in Graph::get is the same as recording in Entry::run
h
right now we're recording in Node::run which isn't quite the same as Entry::run
w
Sure, point stands
h
I'm not sure where Graph::get gets called
w
It's how nodes are created
Copy code
let start = ...
doX().map(|..| { record(start - end); ...})
Is how you time async code
But because it's async, recording inside doX is basically equivalent to outside
Because actually calling the method takes ~no time
h
ah okay yeah, when I said "inside
Node::run
earlier, I mean within the lexical body of that function; but that function is using
.inspect
to make sure that one of teh calculations happens in a future that is run after the future that does the body of the work
so yeah putting something like that
inspect
future outside of
Node::run
wouldn't make much of a difference
a
What I was saying was that ideally we do it in run, not in a map or inspect, because of potential delay. In an ideal world that delay should be trivial, but in reality it may not be.
h
@average-vr-56795 so what you're saying is that the current implementation that computes the time span in
<http://nodes.rs|nodes.rs>
might be subtly wrong?
w
@average-vr-56795 unless there is blocking code somewhere, it will be nanoseconds
a
Yes, I am not confident that we have no blocking code :)
h
if we do have blocking code, isn't that part of the time that we're legitimately trying to measure?
w
Yea, arguably.
a
Blocking code in a different Future
w
@average-vr-56795: no, that would not be recorded here
Or, that would not be recorded "falsely" here.
a
The problem is that if some other Future schedules on the executor between the thing we're trying to measure, and the closure where we actually measure it, that blocking time would be added to the timing of this Future
w
It's async code: that's always the case, no matter where you measure
a
Not if it's all within the one function you're actually measuring :)
w
Even then.
Because you are recording the runtime of an async task that will intentionally pause N or more times
h
the Node::run function calls one of several futures within itself
and any actual task might do arbitrary asynchronous stuff
a
How? What causes the single thread's call stack get pre-empted?
h
which means that the same task could take more or less time on different runs dependingon what else was going on with the scheduler at that time
w
@average-vr-56795: there are 400k tasks running: they will suspend to request other work
It's a cooperative scheduler: the tasks pause themselves to wait for other things.
Anyway... IMO, it does not matter from an accuracy perspective whether Graph::get or Node::run. It's more about which allows you to record what you need.
h
it looks like Graph::get gets called from Context::get
and that get called from the
Node::run
implementation of
Select
possibly other places too
what's the difference between
graph
and
rule_graph
on
Core
?
i.e. is the "dst_node" in Graph::get the same as the notion of the "first requesting node" we were talking about earlier?
w
src_node is the requesting node
@average-vr-56795: thoughts on whether this whole thing should use Task-locals to pass this information along? Or tokio-tracing directly?
a
Seems like a heavyweight thing to introduce tokio-tracing for, but if we're going to start using it regardless, no strong thoughts either way :)
w
Yea, I'm fairly sure that this has been accomplished using locals in other implementations (certainly in finagle, probably in tokio-tracing).