we're trying to debug some performance issues with...
# development
h
we're trying to debug some performance issues with running PEXs, and this is leading us into questions about graph and process_execution code that I'd like to get @witty-crayon-22786’s and maybe @average-vr-56795’s input on
first question - the "duration" that we report in the v2 UI ultimately comes from here: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L279
which is where we create a future that will be lazy-executed on the NodeContext's runtime
and then return an
EntryState::Running
for some nodes, particularly EPRs, it seems like there's a lot of future-scheduler-overhead between when we create this future and when it actually runs
and that makes me think that maybe counting the real time since we create the
EntryState::Running
and put the current
Instant::now()
into its
start_time
field isn't the right duration to be displaying to users
but I don't know enough about how futures scheduling works to know what happens after that call to
context_factory.spawn
that duration is also how we compute
heavy_hitters
for the UI, so if that duration isnt helpful for users than maybe the heavy hitters list isn't either
w
I doubt that all of that* is overhead, rather than just "contention"
When you're doing hundreds of async things at once, some of them will have to wait for others to complete.
If you'd like to get clean latency data, ensure that you're profiling in the context of an absolutely minimal run.
h
this is more about, what information is helpful to show to pants users running the v2 UI normally
h
I think this is indeed contention
But it's still odd to show users those timings, it looks like this is how long work is actually taking.
w
Ok
h
I think that we have the intuition that what the v2 UI actions pane should be indicating, is what work items are running right now
👍 1
and that's not what the current instantiation of heavy_hitters is doing, I believe
w
The heavy hitters stuff is pretty naive .. if you wanted special handling for processes, could do that with a new method on the Node trait, for example
h
We're also not sure what, if anything,
process_execution_local_parallelism
is doing in practice.
👍 2
If I set that to 2 or to 1 I get the same local performance.
h
specifically it looks like that option is getting set as the
bound
for
BoundedCommandRunner
, but I'm not sure what that type is meant to do to begin with
w
It's limiting the number of subprocesses we'll spawn at once
h
I mean actual end-to-end real and user
time
w
If you only have 2 cores, increasing it a lot beyond 2 won't do much
h
I have 4 cores
so 8 hyperthreads
w
What Greg said
If there were 10 processes "able" to run, the bound would avoid starting all 10 simultaneously
h
Hmm, so I'm seeing the same end-to-end user/real/sys time whatever I set that to.
In particular, I'm seeing my user time be about 3.5x my real time, so some parallel execution is happening. More than I asked for in fact!
Let me check if for whatever reason that option is going through
h
so I think that implies that
BoundedCommandRunner
might not be the limiting factor in the use case we're debugging
w
It is only for processes
The rest of the engine is still multithreaded
h
Right
w
And it defaults to "number of cores"... 2 was just an example
h
@witty-crayon-22786 one thing that i don't have a great sense of is how the futures executor that pants uses maps spawned futures to actual threads
h
but the time here is completely dominated by long-running EPRs (requirement pex resolution)
h
my understanding is that the details of this are something that the futures library basically takes care of, but can be fine-tuned
but
NodeContext::spawn
is our own trait, it's not a futures-library thing
w
Yes, we're using the tokio runtime... No, it can't really be fine tuned afaik
Spawn is using tokio spawn underneath.
h
so that's calling stuff in our
task_executor
crate, and that's proxying to
tokio
?
w
"all" of the futures might effectively be executing concurrently
Yes
h
so I think this still comes down the duration of time between when a future was scheduled and when it runs not being a helpful thing to use to decide what to display to users
w
@happy-kitchen-89482: if all of the time is in EPR, then there is your answer... Would tease that apart by adding debug output or additional workunits in the LocalCommandRunner
h
but I'm not sure what is useful
h
So in terms of user experience, we probably only want to display heavy hitters that have actually been spawned (so their number would never exceed process_execution_local_parallelism, which is also probably what the user expects)
Is that do-able?
w
@hundreds-breakfast-49010: probably none of it. If your time is in EPR, add more debug output to LocalProcessRunner
@happy-kitchen-89482: yes, see above re heavy_hitters.
h
we're not just trying to debug our own workflow, we're also trying to think about how to construct a v2 UI output that is useful for end users
w
Could add something to the node trait to "check whether the node should be rendered/considered" or something
But it doesn't sound like this is a matter of rendering... It sounds like you have profiling to do.
h
sure, but that also means that the v2 UI output is misleading, for us as well as end users
w
Understood... But scheduling is measured in the nanoseconds. So if you are seeing delays that make it look like processes are taking a long time, the most likely thing is that... They are taking a long time
("the most likely"... not certain, but you should absolutely rule it out first before going about changing rendering, imo)
h
creating the future is quick, but the delay between when that future is created and when it actually runs might not be, depending on what else the engine is doing
and that time is getting counted as time it takes for a graph node to run
both in terms of what the UI chooses to display and also what literal number of seconds we print in the UI
and that's not the right thing to do, I think, regardless of whether or not we also have some slow EPRs we need to debug
w
Sure. Priorities is all I'm saying.
h
@witty-crayon-22786 re "they are taking a long time" - the processes in question take ~60 seconds each, when we run on a single target. When we run on 10 targets we see a total runtime of somewhat under 600 seconds, so it looks like we're not getting much parallelism benefit. Could be contention, could be something else. That remains to debug. But putting that to one side, what we display to the user is that one process took 60 seconds, one took 120 seconds, and so on (the numbers aren't so precise, but you get the idea). So this is bad UX IMO.
What I think the user will expect is to only see work that has been spawned (processes they could see via
ps
), and the timings from when they were spawned.
w
ah, got it. sure.
h
Does that make sense?
w
yep.
👍 1
see above re how to do that... changes to the Node trait that would be consumed in heavy_hitters.
that code is the only consumer of the Node runtimes, so if you don't need them anymore, can remove them.
h
@witty-crayon-22786 does it make sense to rewrite the heavy_hitters code to only start counting time from when this
run
call happens: https://github.com/pantsbuild/pants/blob/master/src/rust/engine/graph/src/entry.rs#L265
this makes sense to me given my mental model of what the engine is doing
I think that would amount to changing how
current_running_duration
works,
and I think that's only used to generate the heavy hitters list anyway
h
Changing that UX would also make the performance issue easier to debug, as we could see at a glance what is truly running.
Can do that in other ways today, but that would be super convenient
w
@hundreds-breakfast-49010: no... it's still async code
moving the location of the timer within the async code won't significantly affect the accuracy, because a node can suspend for a while at any time.
🤔 1
h
hm yeah
hm, we have some infrastructure that tries to track what time a node is actually doing work with the workunits infrastructure
we might want to also use that to decide what work items to display in the UI
w
one approach would be to remove timing for any nodes that are truly async
and then only render timing on nodes that are synchronous, like process executions
@hundreds-breakfast-49010: you'll see the same timespans in workunits, afaik.
h
those are the things that are likely to take a long time in any case
what was this UI originally designed to show?
maybe we should scrap the notion of trying to show the exec time for any node, and only show process executions in that section of the the v2 UI?
👍 1
w
that's roughly what i was suggesting, yea.
i don't think that you would skip rendering other types of nodes... you just wouldn't show timing for it.
r
I agree with Stu that the interface shouldn’t be used to gauge performance bottlenecks, and I see its primary goal to be “make sure the user sees that we’re doing something”. Ideally, the output should change fast enough that the user can’t actually read it for a lot of operations. However, if this is a priority for you folks, the place where we have decided that we’ll start recording real duration time is in the implementations of
WrappedNode::run
. In particular, for EPRs we use the v2 notion of
workunits
for that. While I don’t think we can use workunits for this (because they are recorded at the end of an EPRs execution), we can probably implement something similar to that for heavy_hitters. This could be as easy as keeping a
Map<UserDisplayNameForNode, RealStartTime>
in the
Context
, or maybe `Session`: - Nodes would optionally add themselves to that map as part of executing the Future returned by the
run
method. - We would only render things from that map. - At the end of the execution of a node, we could make it remove itself from the map. Alternatively, we could create another method in the trait
WrappedNode
, something like
run_impl
, and rename all the current implementations of
run
to
run_impl
. Then, we can implement
run
as a method in the trait that will do all the map finagling for us, something like (totally bogus syntax):
Copy code
fn run(self, context: Context) -> NodeFuture<Value> {
  future::create(move |_| { // register start time now in context, if I have a user_readable_name })
  .and_then(|_| {
    self.run(context)
  })
  .inspect(|_| { // de-register myself from the map })
}
This way, we leave
run
unchanged, and we center this recording of times to one particular function.
h
(Thank you @red-balloon-89377 for taking the time to write this all out!)
👍 2
w
@red-balloon-89377: that runs afoul of what i was mentioning to Greg above though: moving the tracking within the async code won't really affect this
👍 1
you'd need to move it to within the lock acquisition of running a local process at the very least
i'm not sure that you can do this generically for all nodes, which is why i suggested adding a method to the Node trait that allows you to compute runtime in a node-specific way
👍 1
r
Hmmm, that’s fair. I guess with EPRs specifically, each relevant
CommandRunner
does have to decide when something is started.
Then I have no more information to give 🙂
h
one issue that occurs to me
right now we choose the number of heavy hitters to display in the UI based on the number of local CPU cores
but, the number of heavy_hitters isn't really directly related to the number of CPU cores. the UI code just happens to use that number to decide how many rows on screen to render, and chooses that many long-scheduled nodes to display
if the primary goal is to "make sure the user sees something", then why should that be related to the teh number of CPU cores?
r
afaik that was just an arbitrary heuristic. In an ideal world, we would display exactly what each CPU was running at that time. I don’t really think that’s practical though.
a
I haven’t read through this whole thread - should I, or did this get sufficiently resolved? 🙂
h
I haven’t read through this whole thread - should I, or did this get sufficiently resolved?
@average-vr-56795 If you have the time to spare, it would help to get your input. tl;dr we’ve been finding that resolving the requirements Pex is really slow in V2. One of the problems is that we believe
--v2-ui
reports timing in a way that doesn’t exactly reflect reality and makes the user believe things are slower than they really are. We’re trying to think of a way to improve the timing output for
--v2-ui
a
Cool, I’ll catch up after reviewing Greg’s PR 🙂
❤️ 2
Yeah, I can definitely get behind the adding-an-extensible-way-to-report-start-time thing described upthread. A cute way of doing this could be to have an
elapsed_time
method on
Node
, have that method consult an optional task-local for “time-actually-started” to override the field, and having the command runners set that task local
💯 2
w
Yea, definitely.
if the primary goal is to "make sure the user sees something", then why should that be related to the teh number of CPU cores?
@hundreds-breakfast-49010: no particular reason... But you can think of it a bit like all of the async code existing in service of actually starting long-running processes
To the extent that a long running process isn't taking a slot, it will either be because: 1) there isn't one ready to run, 2) some async code is running that will probably result in a process that will fill the display slot
But it's totally the case that you could be running N processes filling all the display slots (naturally, by virtue of them taking a long time to run), while a bunch of other async stuff is happening in the background to prepare more processes
One other thing: the number of slots could be something like "command runner potential concurrency" rather than "num cores", because in particular, remote execution allows for more than the local core count for sure.
But yea: processes.
h
@average-vr-56795 I think that some of the code around Workunits already does this or at least attempts to do this correctly
certainly there we care about recording the actual start time and actual end time of a chunk of work
and it makes sense to me that the v2 UI should be using that same notion of what time an actual chunk of work started
a
Yeah, that all sounds reasonable 🙂
As far as I know, processes (and possibly filesystem operations) should be the only thing with noticeably different numbers there, because we actively throttle them
👍 2