I am looking for some help on a performance issue ...
# general
s
I am looking for some help on a performance issue i am having with pants 2.0.0. I have what I believe is a unique setup with one of our application servers that we build and run with pants. This specific server spawns other pants processes from within itself. With pants v1 we were able to run multiple 
pants run
  commands from within a single 
pants run
  process without issue. With the v2 engine we are CPU and memory bound and the native OS will start killing random processes (as expected) when we reach the ceiling. A little background on how this server works:
Copy code
application server: main application process running with ./pants run
child server: Receives a request to start a "task" or series of "tasks" = multiple ./pants run sub-processes executed in parallel. These subprocesses then exit gracefully after they are completed.
The way this works, we can receive sub process commands to execute dozens of 
pants run
 “task” commands at a single time. Some things I have tried that do not have effect on performance: 1. pantsd on or off 2. Concurrency limits set in global pants options I am looking for some guidance here as we have tried everything that comes to mind without changing how this application runs. (That is out of our scope at this time as the main concern here is getting to pants 2.0.0) We are currently on pants 1.30.
w
interesting
so the constraint is basically “how much total memory does each pants instance use”?
s
that and CPU. with pantsd off our execution of tasks goes further … but still hits the ceiling
w
yea.
so… two things i can think of:
1)
./pants run
should be roughly equivalent to
./pants package && dist/app.pex
… but the latter exits before running the application, which would free resources sooner
2) we’re getting much, much closer to https://github.com/pantsbuild/pants/issues/7654 , which would allow multiple concurrent runs with a single instance of pantsd, which would significantly reduce memory/cpu usage here since they’d share work. but it is very unlikely to be backported to `2.0.x`… it will likely first be available in
2.2.x
s
few comments: 1. we tried this but still used the old binary goal - same issue because so many ./pants binary commands are running in parallel it hits the ceiling. Is there a difference in the package goal vs binary other than the syntax? 2. This would maybe help with things… Do you have a timeline on this so we could test it out? I am curious why we are seeing such a performance degradation with v2 engine vs v1 when executing this same flow.
w
Is there a difference in the package goal vs binary other than the syntax?
no, different name for the same thing.
I am curious why we are seeing such a performance degradation with v2 engine vs v1 when executing this same flow.
v1 was not parallel at all… so adding parallelism around it was probably reasonable. v2 is parallel, so you should be able to run
./pants package ::
(all the things) to build all binaries in parallel, but wrapping more parallelism around it is going to over-saturate
👍 1
but, one potential tack:
--process-execution-local-parallelism
only controls the parallelism of processes that we fork: it doesn’t control the number of threads that we’ll use to run
@rules
👀 1
that would primarily affect CPU usage, but it might improve memory slightly
s
yeah being able to control the number of threads for
@rules
may help here. CPU is our main concern at the moment as its the reason our host OS is shooting PIDs. The way this server is architeted is it acts like a giant gateway per say. It waits for a “task server” to be kicked off and then launches a bunch of mini servers that mock out the flow of an AWS step function in our local and CI environments. We know we could optimize here for improvements on our end but at this time our goal is just to get to pants 2
Do you have any suggestions on things we can do (presently) to help mitigate this issue?
w
Do you have a timeline on this so we could test it out?
in the next month, most likely? if it’s your only blocking issue, we could look into bumping it up a bit.
s
we are 100% blocked on implementing pants 2 because of this issue. This should be our last hurdle
w
ok, thanks.
CPU is our main concern at the moment as its the reason our host OS is shooting PIDs.
interesting. are you sure…? i didn’t realize that that was a thing!
if you’re pretty sure, i could get a patch out to expose setting thread counts today, which you could experiment with.
grabbing lunch, back in a bit.
e
Agreed with Stu - CPU overusage killing would be new to me, the pid shooting sounds like the OOMKiller.
s
We were running
top
on our linux boxes watching this application run while our tests ran. We observed the pants processes taking CPU to the ceiling and memory was pretty contained. I could verify again to be 1000% sure.
@witty-crayon-22786 if you could patch the thread counts that would be 🔥 and we could play with them today and see if that helps.
w
s
ill do this now
w
(i’ve been working up a patch to make this configurable, but it’s taken longer than i expected because it’s used so early during startup. will have it out today.)
s
ok thank you. ive been trying some other internal things to see if i can make improvements. all to no avail so far…
w
which release of pants are you targeting? i can backport this to 2.0.x or 2.1.x if need be, but it might be worth having you run a dev version to validate it before we go ahead on getting it into a release
s
im running 2.0.1rc4
i have to be there as i havent fixed things for deprecated options in 2.1 yet
w
got it. yea, reasonable.
s
i can run a dev version but it has to be 2.0.x
w
yep, i can arrange that.
👍 1
s
perfect i appreciate the help here
w
do you have access to s3 from the place you’re testing?
@salmon-barista-63163: ^
s
yeah its in circleci so i can pull from s3 yes
w
great. yea, that makes it easier.
👍 1
will kick off a build shortly, and then you can use the PANTS_SHA support in more recent versions of the
pants
script to try out the pre-release build: https://github.com/pantsbuild/setup/blob/69351495867bd555a76b4a523f816b66acb8506f/pants#L15-L18
ok, as soon as https://travis-ci.com/github/pantsbuild/pants/builds/209249097 goes green (might be about 60 minutes), you can use
PANTS_SHA=372942e0f30763fff8b11a8a344fb6c36a6b23a0
. it’s branched from
2.0.1rc4
logging off for the evening: good luck. can look further tomorrow.
(peeked back in to see how this was doing, and it needed an update. edited the above)
s
👍 im going to try this now. Thanks for doing the work to get that out to me. Ill have an update shortly
what var can i set in the global config to gain access to the threading
w
It looks like if you set it low enough you can deadlock... so would probably not go below 2.
👍 1
let me know if this works out for you. i think we’ll land it regardless, but i might not cherry-pick it unless you confirm that it addresses your issue.
It looks like if you set it low enough you can deadlock... so would probably not go below 2.
i’m adding a defense against this to the final patch.
s
nice call on the defense. This looks to have semi solved my issue… this in combo with some modifications to our code seem to have worked.
looks like the limit on a 8 CPU 16 GIG MEM machine is about 10 pants processes at once. any more in parallel it hits the ceiling
e
Excellent. Did you get a chance to grep for OOMKiller evidence or lack thereof prior to the fix?
s
it was CPU bound. OOM killer was not killing the processes.
w
Cool beans. So I assume you need this backported to both 2.0.x and 2.1.x then, huh
s
yes please. We are going to finalize our 2.0.x migration today. Then we are going to move to 2.1.x in the coming weeks.
e
it was CPU bound. OOM killer was not killing the processes.
Huh, ok - thanks. At some point I'm sure we'll need to learn what subsystem or configuration implements this CPUKiller - we're bound to see it again.
w
@salmon-barista-63163: ok: if you’re unblocked using that branch, i will likely wait until after we’ve cut the current final releases of 2.0.1 and 2.1.1 (tomorrow morning: they’ve been queued for a really long time now) to backport this
s
@witty-crayon-22786 that is fine. I am unblocked for now
@enough-analyst-54434 @witty-crayon-22786 I agree here. I am more than happy to get on a call with whomever and show this happen real time. maybe do a little debugging. just let me know