Hi folks Wondering if anyone has tried to parallelize tests Pants #general

Hi folks. Wondering if anyone has tried to paralle...

dazzling-diamond-4749

09/07/2021, 11:11 PM

Hi folks. Wondering if anyone has tried to parallelize tests in buildkite? I'm getting weird coverage report failures. I'm planning the tests by filtering for all the test targets and

split -l $block_size

. I suspect that I need to toposort and dedupe test targets?

witty-crayon-22786

09/07/2021, 11:17 PM

you shouldn’t need to unless you have test targets depending on other test targets (which is discouraged, but not banned, afaik)

witty-crayon-22786

09/07/2021, 11:21 PM

assuming that we’re talking about Pants 2.x though, tests should be fully isolated and parallelized by the number of cores by default

witty-crayon-22786

09/07/2021, 11:22 PM

have you already confirmed that the auto-detected https://www.pantsbuild.org/docs/reference-global#section-process-execution-local-parallelism value in this environment looks like it will saturate the box?

dazzling-diamond-4749

09/07/2021, 11:22 PM

yes we are using pants 2.x. I'm getting coverage report failures that I can't repro locally

dazzling-diamond-4749

09/07/2021, 11:23 PM

I think we are using autodetect. I can't find an option on buildkite to enable a bigger build instance

dazzling-diamond-4749

09/07/2021, 11:23 PM

so I resorted to parallelizing the tests

witty-crayon-22786

09/07/2021, 11:25 PM

what is the coverage failure?

witty-crayon-22786

09/07/2021, 11:26 PM

and is this related to https://github.com/pantsbuild/pants/issues/12726 ?

dazzling-diamond-4749

09/07/2021, 11:26 PM

I get a log line like

Copy code

22:20:50.31 [WARN] Failed to generate coverage data for some/package/test/python/my/test.py:tests

dazzling-diamond-4749

09/07/2021, 11:27 PM

But can not repro locally

witty-crayon-22786

09/07/2021, 11:30 PM

interesting… do all targets report that warning, or only some of them? and to confirm, you’re not running with

test --debug

, are you?

dazzling-diamond-4749

09/07/2021, 11:30 PM

just one of them. Deterministically. Let me add

--debug

flag and run again

witty-crayon-22786

09/07/2021, 11:30 PM

sorry: would not recommend adding

--debug

witty-crayon-22786

09/07/2021, 11:31 PM

just wanted to confirm that you were not using it in this environment.

dazzling-diamond-4749

09/07/2021, 11:31 PM

oh no, I'm not adding --debug in CI

witty-crayon-22786

09/07/2021, 11:31 PM

--debug

causes things to run sequentially in the foreground, and it disables some portions of coverage capture.

witty-crayon-22786

09/07/2021, 11:31 PM

do you have a separate

toml

config for CI by any chance?

dazzling-diamond-4749

09/07/2021, 11:31 PM

yes

witty-crayon-22786

09/07/2021, 11:32 PM

does it set

[test] debug

dazzling-diamond-4749

09/07/2021, 11:32 PM

Copy code

[test]
use_coverage = true

[coverage-py]
report = ["json", "html"]
global_report = true

👍 1

witty-crayon-22786

09/07/2021, 11:33 PM

ok. since it’s just one target, i do suspect something fishy related to test targets depending on other test targets

witty-crayon-22786

09/07/2021, 11:34 PM

or perhaps a target that is actually a library, but is marked as a test

dazzling-diamond-4749

09/07/2021, 11:34 PM

is there a query I can run to find test depending on tests? Bazel style? 😂

witty-crayon-22786

09/07/2021, 11:35 PM

not exactly 😃 …

./pants dependees $target

should tell you if that target is being pulled in elsewhere

✅ 1

witty-crayon-22786

09/07/2021, 11:36 PM

can also use something like

./pants filter --target-type=python_tests :: | xargs ./pants dependees

to check all test targets at once

dazzling-diamond-4749

09/07/2021, 11:40 PM

Copy code

$ ./pants filter --target-type=python_tests :: | xargs ./pants dependees             13s
16:39:26.19 [INFO] Initialization options changed: reinitializing scheduler...
16:39:26.89 [INFO] Scheduler initialized.
16:39:28.07 [INFO] Initialization options changed: reinitializing scheduler...
16:39:29.14 [INFO] Scheduler initialized.

No such targets found. One more interesting observation, the ordering of files in the test report is different in CI and locally, when I run the same command.

witty-crayon-22786

09/07/2021, 11:41 PM

mm, that’s not ideal.

witty-crayon-22786

09/07/2021, 11:42 PM

but possibly related to the other issue you’re experiencing with this problematic target.

dazzling-diamond-4749

09/07/2021, 11:42 PM

Local is Mac, and CI is Ubuntu 18

dazzling-diamond-4749

09/07/2021, 11:43 PM

Is there any way to get job logs? Seems like they are suppressed. Maybe there is way to see what coveragepy is complaining about?

witty-crayon-22786

09/07/2021, 11:46 PM

good idea. you can pass

test --output=all

(or set

[test] output = "all"

) to have the tests dump their output, which might include some more information about why the

.coverage

file wasn’t written

✅ 2

dazzling-diamond-4749

09/07/2021, 11:52 PM

While I wait for CI, just wanna get a sanity check on my test script. Maybe I wrote something silly

Copy code

./pants -l=error filter --filter-target-type=python_tests :: | sort > /tmp/all_tests

total_tests=$(cat /tmp/all_tests | wc -l)

total_blocks=$BUILDKITE_PARALLEL_JOB_COUNT
if test -z "$total_blocks"; then
    total_blocks=1
fi
current_block=$BUILDKITE_PARALLEL_JOB
if test -z "$current_block"; then
    current_block=0
fi

block_size=$(((total_tests/total_blocks)+1))

size as well
split -l $block_size -a 1 /tmp/all_tests test_target_set_

extra_space=$((block_size*total_blocks-total_tests))
empty_blocks=$((extra_space/block_size))

alphabet=({a..z})
block_index=${alphabet[current_block]}

# Skip empty blocks
if [ "$current_block" -ge "$((total_blocks-empty_blocks))" ]; then
    exit 0
fi

tests_to_run=$(cat test_target_set_$block_index | tr "\n" " ")

./pants test --output=all $tests_to_run

dazzling-diamond-4749

09/08/2021, 12:58 AM

I found the cause. The test crashed and no coverage was ever generated. I'm still trying to figure out why running the test alone vs running groups tests would result in test passing vs crashing.

dazzling-diamond-4749

09/08/2021, 1:00 AM

the test seems to crash without logs. No core dump no stacktrace 🤦‍♂️

👀 1

hundreds-father-404

09/08/2021, 1:02 AM

do you have the full Pants run for that?

hundreds-father-404

09/08/2021, 1:14 AM

fyi Rob shared a log over DM (for privacy). Pants reports that the test fails with exit code -9, which is sigkill and usually OOM killed @dazzling-diamond-4749 you said local is macOS, which has no OOM killer Do you expect the tests to be using lots of memory?

🙏 1

dazzling-diamond-4749

09/08/2021, 1:14 AM

Do you expect the tests to be using lots of memory?

Ohhhhhh, maybe. Its loading BERT, LOL

👆 1

hundreds-father-404

09/08/2021, 1:15 AM

might be worth playing with https://www.pantsbuild.org/docs/using-pants-in-ci#tuning-resource-consumption-advanced also

✅ 1

dazzling-diamond-4749

09/08/2021, 1:16 AM

I see. I guess, is there a way to print OOM killed instead of just exiting -9? I'd be happy to contribute code for this

hundreds-father-404

09/08/2021, 1:19 AM

Hmmm maybe, yeah. You can't get Pytest itself to do that, it gets killed abruptly without being able to clean up The exit code line comes from https://github.com/pantsbuild/pants/blob/824b8d70dbdfb86cc89b1bebfaea9711fca4a715/src/python/pants/core/goals/test.py#L127. Maybe we should augment it to detect exit code's of 9 and -9? Or in general when running any Process, detect that exit code and always log a warning? That would catch things like resolving requirements, running a linter, etc, rather than just running the test I'm not sure, always tricky to know when to add special casing like this vs. when it's distracting, including the risk of leading people astray. What do you think?

dazzling-diamond-4749

09/08/2021, 1:23 AM

Let me read a bit on linux OOM killer. Maybe there is some stable interface we can depend on to print OOM killed?

witty-crayon-22786

09/08/2021, 1:53 AM

Negative exit codes are always signals. So we could convert that into something like "received signal SIG$X"

dazzling-diamond-4749

09/08/2021, 1:55 AM

exit -x

is -->

kill -x

witty-crayon-22786

09/08/2021, 2:03 AM

Yes, although it looks like this is a bit process-API specific. I believe it's always true in our usage of the Rust process APIs.

dazzling-diamond-4749

09/08/2021, 2:03 AM

I see, I was just reading about

128 + SIG

as the standard convention

witty-crayon-22786

09/08/2021, 2:07 AM

https://github.com/pantsbuild/pants/blob/824b8d70dbdfb86cc89b1bebfaea9711fca4a715/src/rust/engine/process_execution/src/local.rs#L392

dazzling-diamond-4749

09/08/2021, 2:08 AM

nice! I'll open an issue and send a PR. + a test that make sure

SIGINT

is correctly caught and surfaced as

Neg(sig)

<-- does this suffice?

witty-crayon-22786

09/08/2021, 2:17 AM

Yea, maybe... I just realized that it might not be iron clad, because when we remote execute a process, we won't be able to get the signal. So maybe the better thing to do would be to change the rust code near where I linked above to append some information about the signal to stderr...

dazzling-diamond-4749

09/08/2021, 2:18 AM

I see. so ?

Copy code

exit_status.signal().map(tee(print)).map(Neg::neg))

witty-crayon-22786

09/08/2021, 2:18 AM

But do feel free to file the issue: if you're interested in fixing it I can add more info about how. If not, I can take a look.

✅ 2

🙏 1

dazzling-diamond-4749

09/08/2021, 2:18 AM

I don't rust at all. Just guess there has to be some thing like

tee

happy-kitchen-89482

09/08/2021, 9:44 AM

To clarify my understanding of this entire thread, when you say "parallelize tests in buildkite" do you mean sharding across multiple buildkite machines? Or running tests in parallel on a single machine? Pants will happily take care of the latter

4 Views

Open in Slack

Previous Next