We ve been getting time outs in CI for tests that launch lot Pants #development

We've been getting time outs in CI for tests that ...

hundreds-father-404

05/27/2021, 7:44 PM

We've been getting time outs in CI for tests that launch lots of processes and/or do 3rdparty resolves, like

pex_tests.py

. I suspect this is from contention - 60 seconds is more than enough for

import_parser_test.py

, for example, which is 16.4s on my machine 🧵

hundreds-father-404

05/27/2021, 7:45 PM

GitHub has 2 cores, so

--process-execution-local-parallelism

defaults to 2. Meaning we have 2 Pytest processes at the same time. Within each Pytest process, tests run sequentially. So I don't expect a wild number of processes at once? Generally, each individual test only spawns 1-2 processes

hundreds-father-404

05/27/2021, 7:47 PM

Perhaps we should lower

--process-execution-local-parallelism

to 1? Meaning we lose all parallelism in CI 👀 Bumping the timeout to >60 seconds for a test that takes 16s locally smells wrong

enough-analyst-54434

05/27/2021, 8:10 PM

So, how confident are you that integration tests respect cpu count limits globally? IOW, we have pants at the outer layer using 2 cores, but what happens when we launch 2 ITs on those two slots? Do they each try to use 2 cores of their own?

enough-analyst-54434

05/27/2021, 8:14 PM

Also, https://github.com/pantsbuild/pants/blob/528aab9fba07c47b76f2190a1e0229f4e062baac/src/python/pants/python/python_setup.py#L103 is not https://github.com/pantsbuild/pants/blob/a81bf4a7e622302d200b38b9b88b2adac2a468ad/src/python/pants/option/global_options.py#L41-L43 The latter is correct when containers are involved. The former will misreport the # of cores of the host and not of the container.

➕ 1

enough-analyst-54434

05/27/2021, 8:22 PM

For example, on my machine with 8 cores:

Copy code

$ python3.8 -c 'import multiprocessing, os; print(f"multiprocessing: {multiprocessing.cpu_count()} os: {os.cpu_count()} sched: {len(os.sched_getaffinity(0))}")'
multiprocessing: 8 os: 8 sched: 8

Versus:

Copy code

$ docker run --cpuset-cpus 0 --rm python:3.8 python -c 'import multiprocessing, os; print(f"multiprocessing: {multiprocessing.cpu_count()} os: {os.cpu_count()} sched: {len(os.sched_getaffinity(0))}")'
multiprocessing: 8 os: 8 sched: 1

enough-analyst-54434

05/27/2021, 8:24 PM

But we still may want to set this manually anyhow. With docker at any rate, using

--cpus 1

only sets a utilization limit of 1 and python still sees 8 via all three methods.

hundreds-father-404

05/27/2021, 8:25 PM

IOW, we have pants at the outer layer using 2 cores, but what happens when we launch 2 ITs on those two slots?

Ah, good point

The latter is correct when containers are involved.

Will fix

[python-setup]

now

enough-analyst-54434

05/28/2021, 5:16 PM

Confirmed we do not control parallelism of either full ITs or RuleRunner ~ITs so our tests do load the machine more than the average Pants-using repo's tests would.

➕ 1

hundreds-father-404

05/28/2021, 5:17 PM

Thanks for checking! Do you already have a fix? I can get started if not

enough-analyst-54434

05/28/2021, 5:17 PM

Not sure if we should or can default those two bits of test infra to 1 thread for local execution and 1 thread for resolves. Clearly 1 for resolves works. I'm not clear on 1 for local execution. Did that cause deadlocks in the past? I thought the rust side had a lower bound in practice that was maybe not enforced.

enough-analyst-54434

05/28/2021, 5:18 PM

Clearly 1 for resolves works

Because

pex -j1

works - no trickery in that impl.

hundreds-father-404

05/28/2021, 5:19 PM

You might be thinking of

--rule-threads-core

? We enforce it's >=2 to avoid deadlocks with interactive processes and goal rules

enough-analyst-54434

05/28/2021, 5:19 PM

Ah, yeah - that's what I was thinking of.

enough-analyst-54434

05/28/2021, 5:19 PM

Ok, I'll whip something up.

🙌 1

hundreds-father-404

05/28/2021, 5:19 PM

Oh, yikes. In RuleRunner:

Copy code

_EXECUTOR = PyExecutor(
    core_threads=multiprocessing.cpu_count(), max_threads=multiprocessing.cpu_count() * 4
)

enough-analyst-54434

05/28/2021, 5:20 PM

Well, that was done with deliberation presumably.

enough-analyst-54434

05/28/2021, 5:23 PM

Hrm, no - that PR was mainly all about prod and the setting there was probably not at all with deliberation: https://github.com/pantsbuild/pants/pull/11325/files

➕ 1

enough-analyst-54434

05/28/2021, 5:23 PM

Lazy reviewers on that one!

hundreds-father-404

06/03/2021, 11:55 PM

@enough-analyst-54434 are you still working on this? CI continues to flake regularly due to timeouts

hundreds-father-404

07/02/2021, 6:47 PM

I haven't seen any timeouts recently. I'm optimistic this helped! Thanks John

Open in Slack

Previous Next