Hey everyone, once again im having performance iss...
# general
s
Hey everyone, once again im having performance issues with the v2 engine and looking for suggestions. Here is my setup:
Copy code
circleci
8 CPU
16 RAM
I am trying to run tests via pytest or behave. In this environment we start up our servers and run the tests. we have about 8 webservers running (with pants) and we kick off the tests. I am getting out of memory errors and process killed issues when running. If you have been following what i have been posting about i am having all sorts of performance issues. I am running pants 2.0.1rc4 (with a twist from a release that was cut for me yesterday). There is no way the performance of v2 can be this much worse than v1 as a whole. At this point im thinking im missing or overlooking something. Any suggestions?
h
I haven’t been following the past threads closely - apologies if you’ve already answered. Do you know where the costs are mostly coming from? Some candidates: * Cache upload and download * Resolving requirements * Pants overhead when setting up tests * The test processes themselves, that they’re slower than they were before Running with
--no-dynamic-ui
may be instructive, as it will output the start time and end time for things
w
my understanding is that you are spawning N
./pants run
“clients”, which are hitting M
./pants run
“servers” on the same box
👍 1
is that right?
s
yes. @witty-crayon-22786 but that part has been resolved. now throwing one more pants process to run tests or execute pytests on top of that really makes things go crazy
little more context: • cci pants cache is loaded no issue. • the dynamic ui is off • pantsd is off Here are the things that are for sure slower with pants v2 than v1. • pants startup processes • pants building a pex file • running pytest • running our behave tests
w
so, during the portion of
./pants run
where the process is spawned,
pants
itself should be using effectively no CPU at all, but it is definitely using memory. this was why john and i were curious about the OOM killer
Here are the things that are for sure slower with pants v2 than v1.
when
pantsd
is off, it is not surprising that the first item is slower:
pantsd
is intended to cut off the startup time.
and all other tasks will be lightly affected as well.
we’re working on optimizing
pytest
runs, because there is some extra overhead there that we know about.
👍 1
but i think that that might all be moot. if your processes are getting killed, i am only aware of the OOM killer being able to do that in linux. i do not know of a facility to kill things based on CPU usage
s
I wasnt aware of anything that would kill processes for CPU either but if i watch the avail memory there is plenty there when it happens
w
are the clients running
pytest
? or some other code?
s
its pytest
w
and this can’t be phrased as a single client running a single run of pants running multiple tests concurrently?
s
i get alot of this error in my pytest
Copy code
"Error reading file File { path: \".coverage\", is_executable: false }: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }"
which typically happens when there is no memory left to finish writing the coverage report
im about to turn off the coverage feeatures and see if this helps
👍 1
If i bump the circle ci executer size up one things start to smooth out…. but this isnt something i can do long term as they are very expensive
w
at a fundamental level, changing your client to be a single run of
./pants test $multiple_targets
with parallelism enabled is what would be ideal, as that would use dramatically less memory.
having tons of independent instances of pants is just not the intended usage right now. we’ll keep working on https://github.com/pantsbuild/pants/issues/7654 , but it’ll be at least a few weeks.
s
so for the tests its one
./pants test
execituon
im struggling with performance impacts everywhere we run pants in our repo.
w
is that after the
./pants run
clients have completed? or concurrently?
s
so when tests run its going to call into our webservers that are running with pants. to me this looks like too many running pants processes at once where with v1 we didnt run into this
it is after the ./pants run cleints are up and running though
w
“up and running” or “done and exited”?
s
“up and running”
w
ok, so you have ~17 pants processes at once?
s
we could have upwards of 20+ yes
w
v2 uses more memory… to a degree, that is intended and expected. it’s keeping work warm for further runs
…which you can only take advantage of with
pantsd
h
I think you mentioned trying running
./pants package
, followed by
dist/app.pex
- rather than
./pants run
. Iirc, you stopped that approach because it was still too much contention when building the PEXes. Now that you have the building of PEXes part fixed, might it be worth trying that approach again? Then, you have ~20 Pex processes running and only 1 Pants process for Pytest
w
so, i think that as discussed yesterday, i think you either need to 1) switch to
./pants package && dist/app.pex
2) wait for https://github.com/pantsbuild/pants/issues/7654, and we could try to prioritize it.
coke 1
s
I am running
./pants package && dist/app.pex
for the one webserver that would spin up tasks (pants processes) on demand. We pre-package ~30 pexs and then execute by just running the pex. This is the rest of our stack that is causing issues now. Its just too many things running in pants i guess
h
It sounds like you’ve been playing with this already, but another thing you could continue tweaking is
--process-execution-local-parallelism
. Note that you can use it as a CLI option, not only a config file value. So, you can set it to a higher value when building PEXes, then lower it dramatically when running tests In v1, it was effectively set to 1. Tests ran sequentially.
Hm, possibly worth trying to see the impact:
./pants test --debug ::
emulates the v1 behavior. It runs each test sequentially in the foreground
w
Screen Shot 2020-12-18 at 10.07.57.png
basically, it looks like circle have a memory usage killer other than the kernel
s
hmmm
i have never seen that file
w
it’s an older post… newer posts about memory limits have more information
s
I have tried running tests sequentially like in v1 by setting the -process-execution-local-parallelism to 1. same issues
ill check into some of then circle ci memory posts
(but, basically: if you try to search for “CPU usage killer” for circleci, you only find information about memory usage)
s
yeah i know the exit code 137 oom one. I litterally get a message from my linux container that says
Copy code
Killed: <pid> Out of memory
thats what i am fighting in some test jobs here and others just all the sudden it exits with exit code 1
the placese where it kills itself is always different . never consistent.
w
s
okay. thanks for chatting regarding it.. i was just trying to get any last ditch efforts to try to solve this.
w
the effect of that would be significantly reducing the amount of pants runs running at once
s
looks like ill have to wait for this
do you have an estimated ETA for that ticket to complete into a future release? I would like to discuss this with my team.
w
the
./pants binary/package && dist/app.pex
change should work in either v1 or v2… are you sure that that isn’t an option?
s
yes this works… but we cannot use that everywhere due to the nature of our system and architecture
I implemented it everywhere i could
w
the ~8 clients and ~8 servers are the critical spot i think
s
yeah that seems to be what i am seeing as well
w
BUT, unfortunately, we definitely won’t be able to backport that change to 2.0.x/2.1.x… it’s built atop a bunch of stuff that only exists in 2.2.x
s
okay. thank you
w
so, i think that we can commit to having this done in January.
s
so we are looking late jan release?
w
yea. done and shipped in January.
s
ok thank you @witty-crayon-22786
I will chat with ym team but looks like we are going to halt our rollout of pants 2 until this happens
w
i’m a broken record, but i do think that the binary/package approach would work around the issue in this case. but we’ll be prioritizing the pantsd change because i 100% agree that it shouldn’t be necessary to avoid concurrent runs.
h
So we can get some context, can you elaborate on the reason that in some cases you have to run a server with
./pants run <tgt>
and cannot switch to
./pants package <tgt> && ./dist/<tgt>.pex
?
We'd like to understand the use-case better
Thanks!
s
@happy-kitchen-89482 So we have a server (lets call it the parent) that runs via a pants run process. That server calls subprocesses that will run a “task”. Lets call those child services. Those tasks are executed with a pants run command. We have since migratred those tasks to use a ./pants package and then run the pex subprocess. The issue here is that we can spin up ~30 of tasks which gives us ~30 pants package processes running in parallel. I hope this answers the question as its kind of hard to explain what we use this server for without explaining alot of our product infrastructure.
I do have a question. Will the fix for pantsd be backported to pants 2.1 or will it only exist in 2.2. We are on 2.0 currently just trying to see what I need to update to in preparation for this release of the bug fix
h
@witty-crayon-22786 to confirm, we would not backport pantsd support for concurrent runs because it is impossible to backport without breaking the deprecation policy, right?
The issue here is that we can spin up ~30 of tasks which gives us ~30 pants package processes running in parallel.
@salmon-barista-63163 to clarify, it sounds like it is not possible to invoke those all in a single run? Like
./pants package tgt1 tgt2 tgt3
s
correct @hundreds-father-404 we do this on the fly as tasks spin up with our application. lwe cannot do all the processes in a single run unfortunately
w
no… it would just be a backport of something like 6 or 7 patches. it would be a big investment to ensure that it was stable.
h
I thought that we must remove all global state to land this change? Meaning
Subsystem.global_instance()
? We can’t backport that because the deprecation policy