this is in the context of trying to close out <https github Pants #development

Join Slack

this is in the context of trying to close out <htt...

# development

hundreds-breakfast-49010

01/15/2020, 10:05 PM

this is in the context of trying to close out https://github.com/pantsbuild/pants/issues/7969

witty-crayon-22786

01/15/2020, 11:09 PM

i think we had discussed it being the "first requesting Node". cc @average-vr-56795, @dry-policeman-7927

average-vr-56795

01/16/2020, 1:02 AM

Yeah, I think that sounds about right... I think there was also a discussion of changing how we generate zipkin spans to be more annotation-based and less span-heavy, but that's probably not relevant for you here :)

hundreds-breakfast-49010

01/16/2020, 1:34 AM

what does it mean to be the first requesting Node, in terms of data structures available in the run method of

NodeKey

witty-crayon-22786

01/16/2020, 1:35 AM

Would need to do it above that, in the Graph

witty-crayon-22786

01/16/2020, 1:35 AM

In the Graph, you know whether a Node/entry is being "created" vs already existing

witty-crayon-22786

01/16/2020, 1:35 AM

..whoops. didn't mean to send to channel

hundreds-breakfast-49010

01/16/2020, 1:36 AM

so maybe it doesn't make sense to create workunits in the

run

Node trait implementation at all

witty-crayon-22786

01/16/2020, 1:40 AM

Possibly. I don't know if we need any information that is only present there

witty-crayon-22786

01/16/2020, 1:42 AM

But yea: if the codepath in run doesn't use any local info, it could probably move elsewhere

average-vr-56795

01/16/2020, 2:02 AM

IIRC the code moved to run so that we could get timing information at the start/end of the run, but no reason parent tracking and timing tracking need to be in the same place

hundreds-breakfast-49010

01/16/2020, 2:05 AM

@average-vr-56795 where was it originally?

witty-crayon-22786

01/16/2020, 2:23 AM

You can still get start/end time in Graph::get. BUT, i think that maybe one issue night be that Graph is generic. Might need to introduce the workunitstore trait there.

average-vr-56795

01/16/2020, 3:06 AM

Not sure, it may have only existed elsewhere before in review :) shouldn't have bearing on you either way

hundreds-breakfast-49010

01/16/2020, 11:33 PM

@witty-crayon-22786 @average-vr-56795 is there any reason why we can't re-use the concept of an

EntryId

that's available per-node in

graph/src/entry.rs

for workunit span_id?

hundreds-breakfast-49010

01/16/2020, 11:33 PM

i.e. are those actually two different concepts, or just the same concept that happens to be instantiated two separate places right now?

witty-crayon-22786

01/16/2020, 11:34 PM

The "spanid" concept is one from distributed tracing, where each span may be generated on a different machine without copliding

witty-crayon-22786

01/16/2020, 11:34 PM

While it wouldn't collide locally to use the entry id as spanid, it might on the context of remote execution, for example

hundreds-breakfast-49010

01/17/2020, 12:05 AM

ah that makes sense

hundreds-breakfast-49010

01/17/2020, 12:05 AM

so we should still generate them randomly from a large pool

hundreds-breakfast-49010

01/17/2020, 12:07 AM

as far as I can tell there's no reason why we can't initialize a workunit somewhere around here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L253

hundreds-breakfast-49010

01/17/2020, 12:07 AM

and the finalize it with the current time and add it to the workunit store somewhere in here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L403

hundreds-breakfast-49010

01/17/2020, 12:07 AM

other than the work of making the

graph

trait aware of

workunit_store

hundreds-breakfast-49010

01/17/2020, 12:08 AM

and then

run

itself wouldn't need to care about workunits

hundreds-breakfast-49010

01/17/2020, 12:09 AM

and I think we would need some kind of method on

Graph

to ask of a node, what its parent is

hundreds-breakfast-49010

01/17/2020, 12:09 AM

unless this already exists

hundreds-breakfast-49010

01/17/2020, 12:09 AM

it is correct to say that a node only has a single parent but multiple children, right? that's the "direction" the graph assumes?

witty-crayon-22786

01/17/2020, 12:17 AM

No: it's a DAG

witty-crayon-22786

01/17/2020, 12:17 AM

So a Node may have multiple parents: hence the question at the head of the thread

hundreds-breakfast-49010

01/17/2020, 12:20 AM

when conceptually does it happen that a Node has multiple parents?

hundreds-breakfast-49010

01/17/2020, 12:21 AM

say an EPR node to be concrete

witty-crayon-22786

01/17/2020, 12:21 AM

If two tests have the same 3rdparty deps

witty-crayon-22786

01/17/2020, 12:21 AM

They'll invoke pex with the same args, which is "the same" epr.

witty-crayon-22786

01/17/2020, 12:22 AM

Or if two rules recurse and depend on the same inner node (maybe two libraries depend on the same library)

hundreds-breakfast-49010

01/17/2020, 12:22 AM

ah yeah, so the first requesting node will be the one that actually invokes

run

on that node

hundreds-breakfast-49010

01/17/2020, 12:23 AM

and the second one will see that node cached and won't enter the code path in

<http://entry.rs|entry.rs>

where

run

gets called

witty-crayon-22786

01/17/2020, 12:23 AM

Sortof. The Graph always invokes run in a new async Task. It's more like a trampoline.

average-vr-56795

01/17/2020, 12:24 AM

IIRC if we can do timing measuring in one function, rather than in different functions, that's useful, because the executor may insert arbitrary delays between futuresy functions

hundreds-breakfast-49010

01/17/2020, 12:25 AM

so maybe the architecture we want is that

Node::run

takes in a

Option<StartedWorkunit>

argument (just like it now accepts a

Context

), that has the parent information from the graph

hundreds-breakfast-49010

01/17/2020, 12:25 AM

and then

run

computes the total run time, and adds that to

StartedWorkunit

to make a

WorkUnit

hundreds-breakfast-49010

01/17/2020, 12:26 AM

and that's what gets added to the store

hundreds-breakfast-49010

01/17/2020, 12:26 AM

witty-crayon-22786

01/17/2020, 12:26 AM

The total runtime can be computed anywhere

witty-crayon-22786

01/17/2020, 12:26 AM

Because it can be attached to the resulting future.

witty-crayon-22786

01/17/2020, 12:27 AM

if it's in the codepath that created the node.

hundreds-breakfast-49010

01/17/2020, 12:27 AM

yeah as long as the

run

method returns its start and end time, we can compute that wherever

hundreds-breakfast-49010

01/17/2020, 12:27 AM

but it's the start and end real-time of

run

that is the span of time we actually care about tracking, right?

witty-crayon-22786

01/17/2020, 12:27 AM

No. All of this is async

witty-crayon-22786

01/17/2020, 12:28 AM

Recording in Graph::get is the same as recording in Entry::run

hundreds-breakfast-49010

01/17/2020, 12:29 AM

right now we're recording in Node::run which isn't quite the same as Entry::run

witty-crayon-22786

01/17/2020, 12:29 AM

Sure, point stands

hundreds-breakfast-49010

01/17/2020, 12:29 AM

I'm not sure where Graph::get gets called

witty-crayon-22786

01/17/2020, 12:29 AM

It's how nodes are created

witty-crayon-22786

01/17/2020, 12:32 AM

Copy code

let start = ...
doX().map(|..| { record(start - end); ...})

witty-crayon-22786

01/17/2020, 12:32 AM

Is how you time async code

witty-crayon-22786

01/17/2020, 12:34 AM

But because it's async, recording inside doX is basically equivalent to outside

witty-crayon-22786

01/17/2020, 12:34 AM

Because actually calling the method takes ~no time

hundreds-breakfast-49010

01/17/2020, 12:34 AM

ah okay yeah, when I said "inside

Node::run

earlier, I mean within the lexical body of that function; but that function is using

.inspect

to make sure that one of teh calculations happens in a future that is run after the future that does the body of the work

hundreds-breakfast-49010

01/17/2020, 12:35 AM

so yeah putting something like that

inspect

future outside of

Node::run

wouldn't make much of a difference

average-vr-56795

01/17/2020, 12:42 AM

What I was saying was that ideally we do it in run, not in a map or inspect, because of potential delay. In an ideal world that delay should be trivial, but in reality it may not be.

hundreds-breakfast-49010

01/17/2020, 12:47 AM

@average-vr-56795 so what you're saying is that the current implementation that computes the time span in

<http://nodes.rs|nodes.rs>

might be subtly wrong?

witty-crayon-22786

01/17/2020, 12:49 AM

@average-vr-56795 unless there is blocking code somewhere, it will be nanoseconds

average-vr-56795

01/17/2020, 12:50 AM

Yes, I am not confident that we have no blocking code :)

hundreds-breakfast-49010

01/17/2020, 12:50 AM

if we do have blocking code, isn't that part of the time that we're legitimately trying to measure?

witty-crayon-22786

01/17/2020, 12:51 AM

Yea, arguably.

average-vr-56795

01/17/2020, 12:51 AM

Blocking code in a different Future

witty-crayon-22786

01/17/2020, 12:51 AM

@average-vr-56795: no, that would not be recorded here

witty-crayon-22786

01/17/2020, 12:51 AM

Or, that would not be recorded "falsely" here.

average-vr-56795

01/17/2020, 12:52 AM

The problem is that if some other Future schedules on the executor between the thing we're trying to measure, and the closure where we actually measure it, that blocking time would be added to the timing of this Future

witty-crayon-22786

01/17/2020, 12:52 AM

It's async code: that's always the case, no matter where you measure

average-vr-56795

01/17/2020, 12:53 AM

Not if it's all within the one function you're actually measuring :)

witty-crayon-22786

01/17/2020, 12:53 AM

Even then.

witty-crayon-22786

01/17/2020, 12:54 AM

Because you are recording the runtime of an async task that will intentionally pause N or more times

hundreds-breakfast-49010

01/17/2020, 12:54 AM

the Node::run function calls one of several futures within itself

hundreds-breakfast-49010

01/17/2020, 12:54 AM

and any actual task might do arbitrary asynchronous stuff

average-vr-56795

01/17/2020, 12:54 AM

How? What causes the single thread's call stack get pre-empted?

hundreds-breakfast-49010

01/17/2020, 12:54 AM

which means that the same task could take more or less time on different runs dependingon what else was going on with the scheduler at that time

witty-crayon-22786

01/17/2020, 12:54 AM

@average-vr-56795: there are 400k tasks running: they will suspend to request other work

witty-crayon-22786

01/17/2020, 12:55 AM

It's a cooperative scheduler: the tasks pause themselves to wait for other things.

witty-crayon-22786

01/17/2020, 12:56 AM

Anyway... IMO, it does not matter from an accuracy perspective whether Graph::get or Node::run. It's more about which allows you to record what you need.

hundreds-breakfast-49010

01/17/2020, 12:57 AM

it looks like Graph::get gets called from Context::get

hundreds-breakfast-49010

01/17/2020, 1:01 AM

and that get called from the

Node::run

implementation of

Select

hundreds-breakfast-49010

01/17/2020, 1:01 AM

possibly other places too

hundreds-breakfast-49010

01/17/2020, 2:21 AM

what's the difference between

graph

and

rule_graph

Core

hundreds-breakfast-49010

01/17/2020, 2:39 AM

i.e. is the "dst_node" in Graph::get the same as the notion of the "first requesting node" we were talking about earlier?

witty-crayon-22786

01/17/2020, 2:49 AM

src_node is the requesting node

witty-crayon-22786

01/17/2020, 8:17 PM

@average-vr-56795: thoughts on whether this whole thing should use Task-locals to pass this information along? Or tokio-tracing directly?

average-vr-56795

01/17/2020, 8:37 PM

Seems like a heavyweight thing to introduce tokio-tracing for, but if we're going to start using it regardless, no strong thoughts either way :)

witty-crayon-22786

01/17/2020, 8:39 PM

Yea, I'm fairly sure that this has been accomplished using locals in other implementations (certainly in finagle, probably in tokio-tracing).

Open in Slack

Previous Next