https://pantsbuild.org/ logo
#development
Title
# development
h

hundreds-breakfast-49010

01/15/2020, 10:05 PM
this is in the context of trying to close out https://github.com/pantsbuild/pants/issues/7969
w

witty-crayon-22786

01/15/2020, 11:09 PM
i think we had discussed it being the "first requesting Node". cc @average-vr-56795, @dry-policeman-7927
a

average-vr-56795

01/16/2020, 1:02 AM
Yeah, I think that sounds about right... I think there was also a discussion of changing how we generate zipkin spans to be more annotation-based and less span-heavy, but that's probably not relevant for you here :)
h

hundreds-breakfast-49010

01/16/2020, 1:34 AM
what does it mean to be the first requesting Node, in terms of data structures available in the run method of
NodeKey
?
w

witty-crayon-22786

01/16/2020, 1:35 AM
Would need to do it above that, in the Graph
In the Graph, you know whether a Node/entry is being "created" vs already existing
..whoops. didn't mean to send to channel
h

hundreds-breakfast-49010

01/16/2020, 1:36 AM
so maybe it doesn't make sense to create workunits in the
run
Node trait implementation at all
w

witty-crayon-22786

01/16/2020, 1:40 AM
Possibly. I don't know if we need any information that is only present there
But yea: if the codepath in run doesn't use any local info, it could probably move elsewhere
a

average-vr-56795

01/16/2020, 2:02 AM
IIRC the code moved to run so that we could get timing information at the start/end of the run, but no reason parent tracking and timing tracking need to be in the same place
h

hundreds-breakfast-49010

01/16/2020, 2:05 AM
@average-vr-56795 where was it originally?
w

witty-crayon-22786

01/16/2020, 2:23 AM
You can still get start/end time in Graph::get. BUT, i think that maybe one issue night be that Graph is generic. Might need to introduce the workunitstore trait there.
a

average-vr-56795

01/16/2020, 3:06 AM
Not sure, it may have only existed elsewhere before in review :) shouldn't have bearing on you either way
h

hundreds-breakfast-49010

01/16/2020, 11:33 PM
@witty-crayon-22786 @average-vr-56795 is there any reason why we can't re-use the concept of an
EntryId
that's available per-node in
graph/src/entry.rs
for workunit span_id?
i.e. are those actually two different concepts, or just the same concept that happens to be instantiated two separate places right now?
w

witty-crayon-22786

01/16/2020, 11:34 PM
The "spanid" concept is one from distributed tracing, where each span may be generated on a different machine without copliding
While it wouldn't collide locally to use the entry id as spanid, it might on the context of remote execution, for example
h

hundreds-breakfast-49010

01/17/2020, 12:05 AM
ah that makes sense
so we should still generate them randomly from a large pool
as far as I can tell there's no reason why we can't initialize a workunit somewhere around here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L253
and the finalize it with the current time and add it to the workunit store somewhere in here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L403
other than the work of making the
graph
trait aware of
workunit_store
and then
run
itself wouldn't need to care about workunits
and I think we would need some kind of method on
Graph
to ask of a node, what its parent is
unless this already exists
it is correct to say that a node only has a single parent but multiple children, right? that's the "direction" the graph assumes?
w

witty-crayon-22786

01/17/2020, 12:17 AM
No: it's a DAG
So a Node may have multiple parents: hence the question at the head of the thread
h

hundreds-breakfast-49010

01/17/2020, 12:20 AM
when conceptually does it happen that a Node has multiple parents?
say an EPR node to be concrete
w

witty-crayon-22786

01/17/2020, 12:21 AM
If two tests have the same 3rdparty deps
They'll invoke pex with the same args, which is "the same" epr.
Or if two rules recurse and depend on the same inner node (maybe two libraries depend on the same library)
h

hundreds-breakfast-49010

01/17/2020, 12:22 AM
ah yeah, so the first requesting node will be the one that actually invokes
run
on that node
and the second one will see that node cached and won't enter the code path in
<http://entry.rs|entry.rs>
where
run
gets called
w

witty-crayon-22786

01/17/2020, 12:23 AM
Sortof. The Graph always invokes run in a new async Task. It's more like a trampoline.
a

average-vr-56795

01/17/2020, 12:24 AM
IIRC if we can do timing measuring in one function, rather than in different functions, that's useful, because the executor may insert arbitrary delays between futuresy functions
h

hundreds-breakfast-49010

01/17/2020, 12:25 AM
so maybe the architecture we want is that
Node::run
takes in a
Option<StartedWorkunit>
argument (just like it now accepts a
Context
), that has the parent information from the graph
and then
run
computes the total run time, and adds that to
StartedWorkunit
to make a
WorkUnit
and that's what gets added to the store
?
w

witty-crayon-22786

01/17/2020, 12:26 AM
The total runtime can be computed anywhere
Because it can be attached to the resulting future.
if it's in the codepath that created the node.
h

hundreds-breakfast-49010

01/17/2020, 12:27 AM
yeah as long as the
run
method returns its start and end time, we can compute that wherever
but it's the start and end real-time of
run
that is the span of time we actually care about tracking, right?
w

witty-crayon-22786

01/17/2020, 12:27 AM
No. All of this is async
Recording in Graph::get is the same as recording in Entry::run
h

hundreds-breakfast-49010

01/17/2020, 12:29 AM
right now we're recording in Node::run which isn't quite the same as Entry::run
w

witty-crayon-22786

01/17/2020, 12:29 AM
Sure, point stands
h

hundreds-breakfast-49010

01/17/2020, 12:29 AM
I'm not sure where Graph::get gets called
w

witty-crayon-22786

01/17/2020, 12:29 AM
It's how nodes are created
Copy code
let start = ...
doX().map(|..| { record(start - end); ...})
Is how you time async code
But because it's async, recording inside doX is basically equivalent to outside
Because actually calling the method takes ~no time
h

hundreds-breakfast-49010

01/17/2020, 12:34 AM
ah okay yeah, when I said "inside
Node::run
earlier, I mean within the lexical body of that function; but that function is using
.inspect
to make sure that one of teh calculations happens in a future that is run after the future that does the body of the work
so yeah putting something like that
inspect
future outside of
Node::run
wouldn't make much of a difference
a

average-vr-56795

01/17/2020, 12:42 AM
What I was saying was that ideally we do it in run, not in a map or inspect, because of potential delay. In an ideal world that delay should be trivial, but in reality it may not be.
h

hundreds-breakfast-49010

01/17/2020, 12:47 AM
@average-vr-56795 so what you're saying is that the current implementation that computes the time span in
<http://nodes.rs|nodes.rs>
might be subtly wrong?
w

witty-crayon-22786

01/17/2020, 12:49 AM
@average-vr-56795 unless there is blocking code somewhere, it will be nanoseconds
a

average-vr-56795

01/17/2020, 12:50 AM
Yes, I am not confident that we have no blocking code :)
h

hundreds-breakfast-49010

01/17/2020, 12:50 AM
if we do have blocking code, isn't that part of the time that we're legitimately trying to measure?
w

witty-crayon-22786

01/17/2020, 12:51 AM
Yea, arguably.
a

average-vr-56795

01/17/2020, 12:51 AM
Blocking code in a different Future
w

witty-crayon-22786

01/17/2020, 12:51 AM
@average-vr-56795: no, that would not be recorded here
Or, that would not be recorded "falsely" here.
a

average-vr-56795

01/17/2020, 12:52 AM
The problem is that if some other Future schedules on the executor between the thing we're trying to measure, and the closure where we actually measure it, that blocking time would be added to the timing of this Future
w

witty-crayon-22786

01/17/2020, 12:52 AM
It's async code: that's always the case, no matter where you measure
a

average-vr-56795

01/17/2020, 12:53 AM
Not if it's all within the one function you're actually measuring :)
w

witty-crayon-22786

01/17/2020, 12:53 AM
Even then.
Because you are recording the runtime of an async task that will intentionally pause N or more times
h

hundreds-breakfast-49010

01/17/2020, 12:54 AM
the Node::run function calls one of several futures within itself
and any actual task might do arbitrary asynchronous stuff
a

average-vr-56795

01/17/2020, 12:54 AM
How? What causes the single thread's call stack get pre-empted?
h

hundreds-breakfast-49010

01/17/2020, 12:54 AM
which means that the same task could take more or less time on different runs dependingon what else was going on with the scheduler at that time
w

witty-crayon-22786

01/17/2020, 12:54 AM
@average-vr-56795: there are 400k tasks running: they will suspend to request other work
It's a cooperative scheduler: the tasks pause themselves to wait for other things.
Anyway... IMO, it does not matter from an accuracy perspective whether Graph::get or Node::run. It's more about which allows you to record what you need.
h

hundreds-breakfast-49010

01/17/2020, 12:57 AM
it looks like Graph::get gets called from Context::get
and that get called from the
Node::run
implementation of
Select
possibly other places too
what's the difference between
graph
and
rule_graph
on
Core
?
i.e. is the "dst_node" in Graph::get the same as the notion of the "first requesting node" we were talking about earlier?
w

witty-crayon-22786

01/17/2020, 2:49 AM
src_node is the requesting node
@average-vr-56795: thoughts on whether this whole thing should use Task-locals to pass this information along? Or tokio-tracing directly?
a

average-vr-56795

01/17/2020, 8:37 PM
Seems like a heavyweight thing to introduce tokio-tracing for, but if we're going to start using it regardless, no strong thoughts either way :)
w

witty-crayon-22786

01/17/2020, 8:39 PM
Yea, I'm fairly sure that this has been accomplished using locals in other implementations (certainly in finagle, probably in tokio-tracing).