I am looking for some help on a performance issue i am havin Pants #general

I am looking for some help on a performance issue ...

salmon-barista-63163

12/16/2020, 8:19 PM

I am looking for some help on a performance issue i am having with pants 2.0.0. I have what I believe is a unique setup with one of our application servers that we build and run with pants. This specific server spawns other pants processes from within itself. With pants v1 we were able to run multiple

pants run

commands from within a single

pants run

process without issue. With the v2 engine we are CPU and memory bound and the native OS will start killing random processes (as expected) when we reach the ceiling. A little background on how this server works:

Copy code

application server: main application process running with ./pants run
child server: Receives a request to start a "task" or series of "tasks" = multiple ./pants run sub-processes executed in parallel. These subprocesses then exit gracefully after they are completed.

The way this works, we can receive sub process commands to execute dozens of

pants run

“task” commands at a single time. Some things I have tried that do not have effect on performance: 1. pantsd on or off 2. Concurrency limits set in global pants options I am looking for some guidance here as we have tried everything that comes to mind without changing how this application runs. (That is out of our scope at this time as the main concern here is getting to pants 2.0.0) We are currently on pants 1.30.

witty-crayon-22786

12/16/2020, 8:22 PM

interesting

witty-crayon-22786

12/16/2020, 8:22 PM

so the constraint is basically “how much total memory does each pants instance use”?

salmon-barista-63163

12/16/2020, 8:23 PM

that and CPU. with pantsd off our execution of tasks goes further … but still hits the ceiling

witty-crayon-22786

12/16/2020, 8:23 PM

yea.

witty-crayon-22786

12/16/2020, 8:24 PM

so… two things i can think of:

witty-crayon-22786

12/16/2020, 8:25 PM

./pants run

should be roughly equivalent to

./pants package && dist/app.pex

… but the latter exits before running the application, which would free resources sooner

witty-crayon-22786

12/16/2020, 8:26 PM

2) we’re getting much, much closer to https://github.com/pantsbuild/pants/issues/7654 , which would allow multiple concurrent runs with a single instance of pantsd, which would significantly reduce memory/cpu usage here since they’d share work. but it is very unlikely to be backported to `2.0.x`… it will likely first be available in

2.2.x

salmon-barista-63163

12/16/2020, 8:32 PM

few comments: 1. we tried this but still used the old binary goal - same issue because so many ./pants binary commands are running in parallel it hits the ceiling. Is there a difference in the package goal vs binary other than the syntax? 2. This would maybe help with things… Do you have a timeline on this so we could test it out? I am curious why we are seeing such a performance degradation with v2 engine vs v1 when executing this same flow.

witty-crayon-22786

12/16/2020, 8:35 PM

Is there a difference in the package goal vs binary other than the syntax?

no, different name for the same thing.

I am curious why we are seeing such a performance degradation with v2 engine vs v1 when executing this same flow.

v1 was not parallel at all… so adding parallelism around it was probably reasonable. v2 is parallel, so you should be able to run

./pants package ::

(all the things) to build all binaries in parallel, but wrapping more parallelism around it is going to over-saturate

👍 1

witty-crayon-22786

12/16/2020, 8:36 PM

but, one potential tack:

--process-execution-local-parallelism

only controls the parallelism of processes that we fork: it doesn’t control the number of threads that we’ll use to run

@rules

👀 1

witty-crayon-22786

12/16/2020, 8:38 PM

we don’t expose a way to configure the latter thing, but we could: https://github.com/pantsbuild/pants/blob/c51d7d27ddbdc448e0bcb75540e8f55d4951c457/src/rust/engine/task_executor/src/lib.rs#L66-L81

witty-crayon-22786

12/16/2020, 8:39 PM

that would primarily affect CPU usage, but it might improve memory slightly

salmon-barista-63163

12/16/2020, 8:41 PM

yeah being able to control the number of threads for

@rules

may help here. CPU is our main concern at the moment as its the reason our host OS is shooting PIDs. The way this server is architeted is it acts like a giant gateway per say. It waits for a “task server” to be kicked off and then launches a bunch of mini servers that mock out the flow of an AWS step function in our local and CI environments. We know we could optimize here for improvements on our end but at this time our goal is just to get to pants 2

salmon-barista-63163

12/16/2020, 8:41 PM

Do you have any suggestions on things we can do (presently) to help mitigate this issue?

witty-crayon-22786

12/16/2020, 8:41 PM

Do you have a timeline on this so we could test it out?

in the next month, most likely? if it’s your only blocking issue, we could look into bumping it up a bit.

salmon-barista-63163

12/16/2020, 8:42 PM

we are 100% blocked on implementing pants 2 because of this issue. This should be our last hurdle

witty-crayon-22786

12/16/2020, 8:43 PM

ok, thanks.

witty-crayon-22786

12/16/2020, 8:44 PM

CPU is our main concern at the moment as its the reason our host OS is shooting PIDs.

interesting. are you sure…? i didn’t realize that that was a thing!

witty-crayon-22786

12/16/2020, 8:46 PM

if you’re pretty sure, i could get a patch out to expose setting thread counts today, which you could experiment with.

witty-crayon-22786

12/16/2020, 8:46 PM

grabbing lunch, back in a bit.

enough-analyst-54434

12/16/2020, 8:47 PM

Agreed with Stu - CPU overusage killing would be new to me, the pid shooting sounds like the OOMKiller.

salmon-barista-63163

12/16/2020, 8:49 PM

We were running

top

on our linux boxes watching this application run while our tests ran. We observed the pants processes taking CPU to the ceiling and memory was pretty contained. I could verify again to be 1000% sure.

salmon-barista-63163

12/16/2020, 8:49 PM

@witty-crayon-22786 if you could patch the thread counts that would be 🔥 and we could play with them today and see if that helps.

witty-crayon-22786

12/16/2020, 8:58 PM

To rule out the oom killer, can check https://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer

salmon-barista-63163

12/16/2020, 9:00 PM

ill do this now

witty-crayon-22786

12/16/2020, 10:42 PM

(i’ve been working up a patch to make this configurable, but it’s taken longer than i expected because it’s used so early during startup. will have it out today.)

salmon-barista-63163

12/17/2020, 12:02 AM

ok thank you. ive been trying some other internal things to see if i can make improvements. all to no avail so far…

witty-crayon-22786

12/17/2020, 12:07 AM

https://github.com/pantsbuild/pants/pull/11325

witty-crayon-22786

12/17/2020, 12:08 AM

which release of pants are you targeting? i can backport this to 2.0.x or 2.1.x if need be, but it might be worth having you run a dev version to validate it before we go ahead on getting it into a release

salmon-barista-63163

12/17/2020, 12:12 AM

im running 2.0.1rc4

salmon-barista-63163

12/17/2020, 12:12 AM

i have to be there as i havent fixed things for deprecated options in 2.1 yet

witty-crayon-22786

12/17/2020, 12:12 AM

got it. yea, reasonable.

salmon-barista-63163

12/17/2020, 12:12 AM

i can run a dev version but it has to be 2.0.x

witty-crayon-22786

12/17/2020, 12:12 AM

yep, i can arrange that.

👍 1

salmon-barista-63163

12/17/2020, 12:13 AM

perfect i appreciate the help here

witty-crayon-22786

12/17/2020, 12:13 AM

do you have access to s3 from the place you’re testing?

witty-crayon-22786

12/17/2020, 12:37 AM

@salmon-barista-63163: ^

salmon-barista-63163

12/17/2020, 12:49 AM

yeah its in circleci so i can pull from s3 yes

witty-crayon-22786

12/17/2020, 12:50 AM

great. yea, that makes it easier.

👍 1

witty-crayon-22786

12/17/2020, 12:55 AM

will kick off a build shortly, and then you can use the PANTS_SHA support in more recent versions of the

pants

script to try out the pre-release build: https://github.com/pantsbuild/setup/blob/69351495867bd555a76b4a523f816b66acb8506f/pants#L15-L18

witty-crayon-22786

12/17/2020, 1:07 AM

ok, as soon as https://travis-ci.com/github/pantsbuild/pants/builds/209249097 goes green (might be about 60 minutes), you can use

PANTS_SHA=372942e0f30763fff8b11a8a344fb6c36a6b23a0

. it’s branched from

2.0.1rc4

witty-crayon-22786

12/17/2020, 1:08 AM

logging off for the evening: good luck. can look further tomorrow.

witty-crayon-22786

12/17/2020, 3:12 AM

(peeked back in to see how this was doing, and it needed an update. edited the above)

salmon-barista-63163

12/17/2020, 3:58 PM

👍 im going to try this now. Thanks for doing the work to get that out to me. Ill have an update shortly

salmon-barista-63163

12/17/2020, 3:59 PM

what var can i set in the global config to gain access to the threading

enough-analyst-54434

12/17/2020, 4:21 PM

That'd be `rule_threads_core`: https://github.com/pantsbuild/pants/pull/11325/files#diff-db4651ae839e702898bac44a00db5fd93b21701f624f57479fb537d7bc55a6d1R553

witty-crayon-22786

12/17/2020, 4:22 PM

It looks like if you set it low enough you can deadlock... so would probably not go below 2.

👍 1

witty-crayon-22786

12/17/2020, 6:14 PM

let me know if this works out for you. i think we’ll land it regardless, but i might not cherry-pick it unless you confirm that it addresses your issue.

It looks like if you set it low enough you can deadlock... so would probably not go below 2.

i’m adding a defense against this to the final patch.

salmon-barista-63163

12/17/2020, 8:37 PM

nice call on the defense. This looks to have semi solved my issue… this in combo with some modifications to our code seem to have worked.

salmon-barista-63163

12/17/2020, 8:38 PM

looks like the limit on a 8 CPU 16 GIG MEM machine is about 10 pants processes at once. any more in parallel it hits the ceiling

enough-analyst-54434

12/17/2020, 9:05 PM

Excellent. Did you get a chance to grep for OOMKiller evidence or lack thereof prior to the fix?

salmon-barista-63163

12/17/2020, 9:26 PM

it was CPU bound. OOM killer was not killing the processes.

witty-crayon-22786

12/17/2020, 9:35 PM

Cool beans. So I assume you need this backported to both 2.0.x and 2.1.x then, huh

salmon-barista-63163

12/17/2020, 10:38 PM

yes please. We are going to finalize our 2.0.x migration today. Then we are going to move to 2.1.x in the coming weeks.

enough-analyst-54434

12/17/2020, 10:42 PM

it was CPU bound. OOM killer was not killing the processes.

Huh, ok - thanks. At some point I'm sure we'll need to learn what subsystem or configuration implements this CPUKiller - we're bound to see it again.

witty-crayon-22786

12/17/2020, 11:23 PM

@salmon-barista-63163: ok: if you’re unblocked using that branch, i will likely wait until after we’ve cut the current final releases of 2.0.1 and 2.1.1 (tomorrow morning: they’ve been queued for a really long time now) to backport this

salmon-barista-63163

12/17/2020, 11:23 PM

@witty-crayon-22786 that is fine. I am unblocked for now

salmon-barista-63163

12/17/2020, 11:24 PM

@enough-analyst-54434 @witty-crayon-22786 I agree here. I am more than happy to get on a call with whomever and show this happen real time. maybe do a little debugging. just let me know

10 Views

Open in Slack

Previous Next