We've been getting time outs in CI for tests that ...
# development
h
We've been getting time outs in CI for tests that launch lots of processes and/or do 3rdparty resolves, like
pex_tests.py
. I suspect this is from contention - 60 seconds is more than enough for
import_parser_test.py
, for example, which is 16.4s on my machine ๐Ÿงต
GitHub has 2 cores, so
--process-execution-local-parallelism
defaults to 2. Meaning we have 2 Pytest processes at the same time. Within each Pytest process, tests run sequentially. So I don't expect a wild number of processes at once? Generally, each individual test only spawns 1-2 processes
Perhaps we should lower
--process-execution-local-parallelism
to 1? Meaning we lose all parallelism in CI ๐Ÿ‘€ Bumping the timeout to >60 seconds for a test that takes 16s locally smells wrong
e
So, how confident are you that integration tests respect cpu count limits globally? IOW, we have pants at the outer layer using 2 cores, but what happens when we launch 2 ITs on those two slots? Do they each try to use 2 cores of their own?
For example, on my machine with 8 cores:
Copy code
$ python3.8 -c 'import multiprocessing, os; print(f"multiprocessing: {multiprocessing.cpu_count()} os: {os.cpu_count()} sched: {len(os.sched_getaffinity(0))}")'
multiprocessing: 8 os: 8 sched: 8
Versus:
Copy code
$ docker run --cpuset-cpus 0 --rm python:3.8 python -c 'import multiprocessing, os; print(f"multiprocessing: {multiprocessing.cpu_count()} os: {os.cpu_count()} sched: {len(os.sched_getaffinity(0))}")'
multiprocessing: 8 os: 8 sched: 1
But we still may want to set this manually anyhow. With docker at any rate, using
--cpus 1
only sets a utilization limit of 1 and python still sees 8 via all three methods.
h
IOW, we have pants at the outer layer using 2 cores, but what happens when we launch 2 ITs on those two slots?
Ah, good point
The latter is correct when containers are involved.
Will fix
[python-setup]
now
e
Confirmed we do not control parallelism of either full ITs or RuleRunner ~ITs so our tests do load the machine more than the average Pants-using repo's tests would.
โž• 1
h
Thanks for checking! Do you already have a fix? I can get started if not
e
Not sure if we should or can default those two bits of test infra to 1 thread for local execution and 1 thread for resolves. Clearly 1 for resolves works. I'm not clear on 1 for local execution. Did that cause deadlocks in the past? I thought the rust side had a lower bound in practice that was maybe not enforced.
Clearly 1 for resolves works
Because
pex -j1
works - no trickery in that impl.
h
You might be thinking of
--rule-threads-core
? We enforce it's >=2 to avoid deadlocks with interactive processes and goal rules
e
Ah, yeah - that's what I was thinking of.
Ok, I'll whip something up.
๐Ÿ™Œ 1
h
Oh, yikes. In RuleRunner:
Copy code
_EXECUTOR = PyExecutor(
    core_threads=multiprocessing.cpu_count(), max_threads=multiprocessing.cpu_count() * 4
)
e
Well, that was done with deliberation presumably.
Hrm, no - that PR was mainly all about prod and the setting there was probably not at all with deliberation: https://github.com/pantsbuild/pants/pull/11325/files
โž• 1
Lazy reviewers on that one!
h
@enough-analyst-54434 are you still working on this? CI continues to flake regularly due to timeouts
I haven't seen any timeouts recently. I'm optimistic this helped! Thanks John