we re trying to debug some performance issues with running P Pants #development

we're trying to debug some performance issues with...

hundreds-breakfast-49010

03/04/2020, 1:37 AM

we're trying to debug some performance issues with running PEXs, and this is leading us into questions about graph and process_execution code that I'd like to get @witty-crayon-22786’s and maybe @average-vr-56795’s input on

hundreds-breakfast-49010

03/04/2020, 1:38 AM

first question - the "duration" that we report in the v2 UI ultimately comes from here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L279

hundreds-breakfast-49010

03/04/2020, 1:39 AM

which is where we create a future that will be lazy-executed on the NodeContext's runtime

hundreds-breakfast-49010

03/04/2020, 1:39 AM

and then return an

EntryState::Running

hundreds-breakfast-49010

03/04/2020, 1:40 AM

for some nodes, particularly EPRs, it seems like there's a lot of future-scheduler-overhead between when we create this future and when it actually runs

hundreds-breakfast-49010

03/04/2020, 1:41 AM

and that makes me think that maybe counting the real time since we create the

EntryState::Running

and put the current

Instant::now()

into its

start_time

field isn't the right duration to be displaying to users

hundreds-breakfast-49010

03/04/2020, 1:42 AM

but I don't know enough about how futures scheduling works to know what happens after that call to

context_factory.spawn

hundreds-breakfast-49010

03/04/2020, 1:43 AM

that duration is also how we compute

heavy_hitters

for the UI, so if that duration isnt helpful for users than maybe the heavy hitters list isn't either

witty-crayon-22786

03/04/2020, 1:43 AM

I doubt that all of that* is overhead, rather than just "contention"

witty-crayon-22786

03/04/2020, 1:44 AM

When you're doing hundreds of async things at once, some of them will have to wait for others to complete.

witty-crayon-22786

03/04/2020, 1:44 AM

If you'd like to get clean latency data, ensure that you're profiling in the context of an absolutely minimal run.

hundreds-breakfast-49010

03/04/2020, 1:45 AM

this is more about, what information is helpful to show to pants users running the v2 UI normally

happy-kitchen-89482

03/04/2020, 1:45 AM

I think this is indeed contention

happy-kitchen-89482

03/04/2020, 1:45 AM

But it's still odd to show users those timings, it looks like this is how long work is actually taking.

witty-crayon-22786

03/04/2020, 1:47 AM

hundreds-breakfast-49010

03/04/2020, 1:47 AM

I think that we have the intuition that what the v2 UI actions pane should be indicating, is what work items are running right now

👍 1

hundreds-breakfast-49010

03/04/2020, 1:47 AM

and that's not what the current instantiation of heavy_hitters is doing, I believe

witty-crayon-22786

03/04/2020, 1:47 AM

The heavy hitters stuff is pretty naive .. if you wanted special handling for processes, could do that with a new method on the Node trait, for example

happy-kitchen-89482

03/04/2020, 1:48 AM

We're also not sure what, if anything,

process_execution_local_parallelism

is doing in practice.

👍 2

happy-kitchen-89482

03/04/2020, 1:48 AM

If I set that to 2 or to 1 I get the same local performance.

hundreds-breakfast-49010

03/04/2020, 1:49 AM

specifically it looks like that option is getting set as the

bound

for

BoundedCommandRunner

, but I'm not sure what that type is meant to do to begin with

witty-crayon-22786

03/04/2020, 1:49 AM

It's limiting the number of subprocesses we'll spawn at once

happy-kitchen-89482

03/04/2020, 1:49 AM

I mean actual end-to-end real and user

time

witty-crayon-22786

03/04/2020, 1:49 AM

If you only have 2 cores, increasing it a lot beyond 2 won't do much

happy-kitchen-89482

03/04/2020, 1:50 AM

I have 4 cores

happy-kitchen-89482

03/04/2020, 1:50 AM

so 8 hyperthreads

witty-crayon-22786

03/04/2020, 1:51 AM

What Greg said

witty-crayon-22786

03/04/2020, 1:52 AM

If there were 10 processes "able" to run, the bound would avoid starting all 10 simultaneously

happy-kitchen-89482

03/04/2020, 1:52 AM

Hmm, so I'm seeing the same end-to-end user/real/sys time whatever I set that to.

happy-kitchen-89482

03/04/2020, 1:53 AM

In particular, I'm seeing my user time be about 3.5x my real time, so some parallel execution is happening. More than I asked for in fact!

happy-kitchen-89482

03/04/2020, 1:53 AM

Let me check if for whatever reason that option is going through

hundreds-breakfast-49010

03/04/2020, 1:53 AM

so I think that implies that

BoundedCommandRunner

might not be the limiting factor in the use case we're debugging

witty-crayon-22786

03/04/2020, 1:54 AM

It is only for processes

witty-crayon-22786

03/04/2020, 1:54 AM

The rest of the engine is still multithreaded

happy-kitchen-89482

03/04/2020, 1:54 AM

Right

witty-crayon-22786

03/04/2020, 1:55 AM

And it defaults to "number of cores"... 2 was just an example

hundreds-breakfast-49010

03/04/2020, 1:55 AM

@witty-crayon-22786 one thing that i don't have a great sense of is how the futures executor that pants uses maps spawned futures to actual threads

happy-kitchen-89482

03/04/2020, 1:55 AM

but the time here is completely dominated by long-running EPRs (requirement pex resolution)

hundreds-breakfast-49010

03/04/2020, 1:55 AM

my understanding is that the details of this are something that the futures library basically takes care of, but can be fine-tuned

hundreds-breakfast-49010

03/04/2020, 1:56 AM

but

NodeContext::spawn

is our own trait, it's not a futures-library thing

witty-crayon-22786

03/04/2020, 1:56 AM

Yes, we're using the tokio runtime... No, it can't really be fine tuned afaik

witty-crayon-22786

03/04/2020, 1:56 AM

Spawn is using tokio spawn underneath.

hundreds-breakfast-49010

03/04/2020, 1:57 AM

so that's calling stuff in our

task_executor

crate, and that's proxying to

tokio

witty-crayon-22786

03/04/2020, 1:57 AM

"all" of the futures might effectively be executing concurrently

witty-crayon-22786

03/04/2020, 1:57 AM

Yes

hundreds-breakfast-49010

03/04/2020, 1:58 AM

so I think this still comes down the duration of time between when a future was scheduled and when it runs not being a helpful thing to use to decide what to display to users

witty-crayon-22786

03/04/2020, 1:58 AM

@happy-kitchen-89482: if all of the time is in EPR, then there is your answer... Would tease that apart by adding debug output or additional workunits in the LocalCommandRunner

hundreds-breakfast-49010

03/04/2020, 1:58 AM

but I'm not sure what is useful

happy-kitchen-89482

03/04/2020, 1:58 AM

So in terms of user experience, we probably only want to display heavy hitters that have actually been spawned (so their number would never exceed process_execution_local_parallelism, which is also probably what the user expects)

happy-kitchen-89482

03/04/2020, 1:58 AM

Is that do-able?

witty-crayon-22786

03/04/2020, 1:58 AM

@hundreds-breakfast-49010: probably none of it. If your time is in EPR, add more debug output to LocalProcessRunner

witty-crayon-22786

03/04/2020, 2:00 AM

@happy-kitchen-89482: yes, see above re heavy_hitters.

hundreds-breakfast-49010

03/04/2020, 2:00 AM

we're not just trying to debug our own workflow, we're also trying to think about how to construct a v2 UI output that is useful for end users

witty-crayon-22786

03/04/2020, 2:00 AM

Could add something to the node trait to "check whether the node should be rendered/considered" or something

witty-crayon-22786

03/04/2020, 2:00 AM

But it doesn't sound like this is a matter of rendering... It sounds like you have profiling to do.

hundreds-breakfast-49010

03/04/2020, 2:01 AM

sure, but that also means that the v2 UI output is misleading, for us as well as end users

witty-crayon-22786

03/04/2020, 2:02 AM

Understood... But scheduling is measured in the nanoseconds. So if you are seeing delays that make it look like processes are taking a long time, the most likely thing is that... They are taking a long time

witty-crayon-22786

03/04/2020, 2:02 AM

("the most likely"... not certain, but you should absolutely rule it out first before going about changing rendering, imo)

hundreds-breakfast-49010

03/04/2020, 2:04 AM

creating the future is quick, but the delay between when that future is created and when it actually runs might not be, depending on what else the engine is doing

hundreds-breakfast-49010

03/04/2020, 2:04 AM

and that time is getting counted as time it takes for a graph node to run

hundreds-breakfast-49010

03/04/2020, 2:04 AM

both in terms of what the UI chooses to display and also what literal number of seconds we print in the UI

hundreds-breakfast-49010

03/04/2020, 2:05 AM

and that's not the right thing to do, I think, regardless of whether or not we also have some slow EPRs we need to debug

witty-crayon-22786

03/04/2020, 2:07 AM

Sure. Priorities is all I'm saying.

happy-kitchen-89482

03/04/2020, 2:09 AM

@witty-crayon-22786 re "they are taking a long time" - the processes in question take ~60 seconds each, when we run on a single target. When we run on 10 targets we see a total runtime of somewhat under 600 seconds, so it looks like we're not getting much parallelism benefit. Could be contention, could be something else. That remains to debug. But putting that to one side, what we display to the user is that one process took 60 seconds, one took 120 seconds, and so on (the numbers aren't so precise, but you get the idea). So this is bad UX IMO.

happy-kitchen-89482

03/04/2020, 2:10 AM

What I think the user will expect is to only see work that has been spawned (processes they could see via

ps

), and the timings from when they were spawned.

witty-crayon-22786

03/04/2020, 2:10 AM

ah, got it. sure.

happy-kitchen-89482

03/04/2020, 2:10 AM

Does that make sense?

witty-crayon-22786

03/04/2020, 2:10 AM

yep.

👍 1

witty-crayon-22786

03/04/2020, 2:11 AM

see above re how to do that... changes to the Node trait that would be consumed in heavy_hitters.

witty-crayon-22786

03/04/2020, 2:11 AM

that code is the only consumer of the Node runtimes, so if you don't need them anymore, can remove them.

hundreds-breakfast-49010

03/04/2020, 2:13 AM

@witty-crayon-22786 does it make sense to rewrite the heavy_hitters code to only start counting time from when this

run

call happens: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L265

hundreds-breakfast-49010

03/04/2020, 2:13 AM

rather than from https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L279 ?

hundreds-breakfast-49010

03/04/2020, 2:13 AM

this makes sense to me given my mental model of what the engine is doing

hundreds-breakfast-49010

03/04/2020, 2:14 AM

I think that would amount to changing how

current_running_duration

works,

hundreds-breakfast-49010

03/04/2020, 2:14 AM

and I think that's only used to generate the heavy hitters list anyway

happy-kitchen-89482

03/04/2020, 2:14 AM

Changing that UX would also make the performance issue easier to debug, as we could see at a glance what is truly running.

happy-kitchen-89482

03/04/2020, 2:14 AM

Can do that in other ways today, but that would be super convenient

witty-crayon-22786

03/04/2020, 2:15 AM

@hundreds-breakfast-49010: no... it's still async code

witty-crayon-22786

03/04/2020, 2:16 AM

moving the location of the timer within the async code won't significantly affect the accuracy, because a node can suspend for a while at any time.

🤔 1

hundreds-breakfast-49010

03/04/2020, 2:18 AM

hm yeah

hundreds-breakfast-49010

03/04/2020, 2:18 AM

hm, we have some infrastructure that tries to track what time a node is actually doing work with the workunits infrastructure

hundreds-breakfast-49010

03/04/2020, 2:19 AM

we might want to also use that to decide what work items to display in the UI

witty-crayon-22786

03/04/2020, 2:19 AM

one approach would be to remove timing for any nodes that are truly async

witty-crayon-22786

03/04/2020, 2:19 AM

and then only render timing on nodes that are synchronous, like process executions

witty-crayon-22786

03/04/2020, 2:20 AM

@hundreds-breakfast-49010: you'll see the same timespans in workunits, afaik.

hundreds-breakfast-49010

03/04/2020, 2:20 AM

those are the things that are likely to take a long time in any case

hundreds-breakfast-49010

03/04/2020, 2:20 AM

what was this UI originally designed to show?

hundreds-breakfast-49010

03/04/2020, 2:21 AM

maybe we should scrap the notion of trying to show the exec time for any node, and only show process executions in that section of the the v2 UI?

👍 1

witty-crayon-22786

03/04/2020, 2:21 AM

that's roughly what i was suggesting, yea.

witty-crayon-22786

03/04/2020, 2:21 AM

i don't think that you would skip rendering other types of nodes... you just wouldn't show timing for it.

red-balloon-89377

03/04/2020, 10:49 AM

I agree with Stu that the interface shouldn’t be used to gauge performance bottlenecks, and I see its primary goal to be “make sure the user sees that we’re doing something”. Ideally, the output should change fast enough that the user can’t actually read it for a lot of operations. However, if this is a priority for you folks, the place where we have decided that we’ll start recording real duration time is in the implementations of

WrappedNode::run

. In particular, for EPRs we use the v2 notion of

workunits

for that. While I don’t think we can use workunits for this (because they are recorded at the end of an EPRs execution), we can probably implement something similar to that for heavy_hitters. This could be as easy as keeping a

Map<UserDisplayNameForNode, RealStartTime>

in the

Context

, or maybe `Session`: - Nodes would optionally add themselves to that map as part of executing the Future returned by the

run

method. - We would only render things from that map. - At the end of the execution of a node, we could make it remove itself from the map. Alternatively, we could create another method in the trait

WrappedNode

, something like

run_impl

, and rename all the current implementations of

run

run_impl

. Then, we can implement

run

as a method in the trait that will do all the map finagling for us, something like (totally bogus syntax):

Copy code

fn run(self, context: Context) -> NodeFuture<Value> {
  future::create(move |_| { // register start time now in context, if I have a user_readable_name })
  .and_then(|_| {
    self.run(context)
  })
  .inspect(|_| { // de-register myself from the map })
}

This way, we leave

run

unchanged, and we center this recording of times to one particular function.

hundreds-father-404

03/04/2020, 1:09 PM

(Thank you @red-balloon-89377 for taking the time to write this all out!)

👍 2

witty-crayon-22786

03/04/2020, 1:14 PM

@red-balloon-89377: that runs afoul of what i was mentioning to Greg above though: moving the tracking within the async code won't really affect this

👍 1

witty-crayon-22786

03/04/2020, 1:15 PM

you'd need to move it to within the lock acquisition of running a local process at the very least

witty-crayon-22786

03/04/2020, 1:16 PM

i'm not sure that you can do this generically for all nodes, which is why i suggested adding a method to the Node trait that allows you to compute runtime in a node-specific way

👍 1

red-balloon-89377

03/04/2020, 2:13 PM

Hmmm, that’s fair. I guess with EPRs specifically, each relevant

CommandRunner

does have to decide when something is started.

red-balloon-89377

03/04/2020, 2:13 PM

Then I have no more information to give 🙂

hundreds-breakfast-49010

03/04/2020, 6:30 PM

one issue that occurs to me

hundreds-breakfast-49010

03/04/2020, 6:30 PM

right now we choose the number of heavy hitters to display in the UI based on the number of local CPU cores

hundreds-breakfast-49010

03/04/2020, 6:31 PM

but, the number of heavy_hitters isn't really directly related to the number of CPU cores. the UI code just happens to use that number to decide how many rows on screen to render, and chooses that many long-scheduled nodes to display

hundreds-breakfast-49010

03/04/2020, 6:32 PM

if the primary goal is to "make sure the user sees something", then why should that be related to the teh number of CPU cores?

red-balloon-89377

03/05/2020, 1:55 PM

afaik that was just an arbitrary heuristic. In an ideal world, we would display exactly what each CPU was running at that time. I don’t really think that’s practical though.

average-vr-56795

03/05/2020, 3:53 PM

I haven’t read through this whole thread - should I, or did this get sufficiently resolved? 🙂

hundreds-father-404

03/05/2020, 3:57 PM

I haven’t read through this whole thread - should I, or did this get sufficiently resolved?

@average-vr-56795 If you have the time to spare, it would help to get your input. tl;dr we’ve been finding that resolving the requirements Pex is really slow in V2. One of the problems is that we believe

--v2-ui

reports timing in a way that doesn’t exactly reflect reality and makes the user believe things are slower than they really are. We’re trying to think of a way to improve the timing output for

--v2-ui

average-vr-56795

03/05/2020, 3:58 PM

Cool, I’ll catch up after reviewing Greg’s PR 🙂

❤️ 2

average-vr-56795

03/05/2020, 5:11 PM

Yeah, I can definitely get behind the adding-an-extensible-way-to-report-start-time thing described upthread. A cute way of doing this could be to have an

elapsed_time

method on

Node

, have that method consult an optional task-local for “time-actually-started” to override the field, and having the command runners set that task local

💯 2

witty-crayon-22786

03/05/2020, 5:13 PM

Yea, definitely.

witty-crayon-22786

03/05/2020, 5:24 PM

if the primary goal is to "make sure the user sees something", then why should that be related to the teh number of CPU cores?

@hundreds-breakfast-49010: no particular reason... But you can think of it a bit like all of the async code existing in service of actually starting long-running processes

witty-crayon-22786

03/05/2020, 5:25 PM

To the extent that a long running process isn't taking a slot, it will either be because: 1) there isn't one ready to run, 2) some async code is running that will probably result in a process that will fill the display slot

witty-crayon-22786

03/05/2020, 5:26 PM

But it's totally the case that you could be running N processes filling all the display slots (naturally, by virtue of them taking a long time to run), while a bunch of other async stuff is happening in the background to prepare more processes

witty-crayon-22786

03/05/2020, 5:28 PM

One other thing: the number of slots could be something like "command runner potential concurrency" rather than "num cores", because in particular, remote execution allows for more than the local core count for sure.

witty-crayon-22786

03/05/2020, 5:28 PM

But yea: processes.

hundreds-breakfast-49010

03/05/2020, 7:36 PM

@average-vr-56795 I think that some of the code around Workunits already does this or at least attempts to do this correctly

hundreds-breakfast-49010

03/05/2020, 7:37 PM

certainly there we care about recording the actual start time and actual end time of a chunk of work

hundreds-breakfast-49010

03/05/2020, 7:37 PM

and it makes sense to me that the v2 UI should be using that same notion of what time an actual chunk of work started

average-vr-56795

03/06/2020, 11:37 AM

Yeah, that all sounds reasonable 🙂

average-vr-56795

03/06/2020, 11:37 AM

As far as I know, processes (and possibly filesystem operations) should be the only thing with noticeably different numbers there, because we actively throttle them

👍 2

Open in Slack

Previous Next