https://pantsbuild.org/ logo
#general
Title
# general
d

dazzling-diamond-4749

09/07/2021, 11:11 PM
Hi folks. Wondering if anyone has tried to parallelize tests in buildkite? I'm getting weird coverage report failures. I'm planning the tests by filtering for all the test targets and
split -l $block_size
. I suspect that I need to toposort and dedupe test targets?
w

witty-crayon-22786

09/07/2021, 11:17 PM
you shouldn’t need to unless you have test targets depending on other test targets (which is discouraged, but not banned, afaik)
assuming that we’re talking about Pants 2.x though, tests should be fully isolated and parallelized by the number of cores by default
have you already confirmed that the auto-detected https://www.pantsbuild.org/docs/reference-global#section-process-execution-local-parallelism value in this environment looks like it will saturate the box?
d

dazzling-diamond-4749

09/07/2021, 11:22 PM
yes we are using pants 2.x. I'm getting coverage report failures that I can't repro locally
I think we are using autodetect. I can't find an option on buildkite to enable a bigger build instance
so I resorted to parallelizing the tests
w

witty-crayon-22786

09/07/2021, 11:25 PM
what is the coverage failure?
d

dazzling-diamond-4749

09/07/2021, 11:26 PM
I get a log line like
Copy code
22:20:50.31 [WARN] Failed to generate coverage data for some/package/test/python/my/test.py:tests
But can not repro locally
w

witty-crayon-22786

09/07/2021, 11:30 PM
interesting… do all targets report that warning, or only some of them? and to confirm, you’re not running with
test --debug
, are you?
d

dazzling-diamond-4749

09/07/2021, 11:30 PM
just one of them. Deterministically. Let me add
--debug
flag and run again
w

witty-crayon-22786

09/07/2021, 11:30 PM
sorry: would not recommend adding
--debug
.
just wanted to confirm that you were not using it in this environment.
--debug
causes things to run sequentially in the foreground, and it disables some portions of coverage capture.
d

dazzling-diamond-4749

09/07/2021, 11:31 PM
oh no, I'm not adding --debug in CI
w

witty-crayon-22786

09/07/2021, 11:31 PM
do you have a separate
toml
config for CI by any chance?
d

dazzling-diamond-4749

09/07/2021, 11:31 PM
yes
w

witty-crayon-22786

09/07/2021, 11:32 PM
does it set
[test] debug
?
d

dazzling-diamond-4749

09/07/2021, 11:32 PM
no
Copy code
[test]
use_coverage = true

[coverage-py]
report = ["json", "html"]
global_report = true
👍 1
w

witty-crayon-22786

09/07/2021, 11:33 PM
ok. since it’s just one target, i do suspect something fishy related to test targets depending on other test targets
or perhaps a target that is actually a library, but is marked as a test
d

dazzling-diamond-4749

09/07/2021, 11:34 PM
is there a query I can run to find test depending on tests? Bazel style? 😂
w

witty-crayon-22786

09/07/2021, 11:35 PM
not exactly 😃
./pants dependees $target
should tell you if that target is being pulled in elsewhere
1
can also use something like
./pants filter --target-type=python_tests :: | xargs ./pants dependees
to check all test targets at once
d

dazzling-diamond-4749

09/07/2021, 11:40 PM
Copy code
$ ./pants filter --target-type=python_tests :: | xargs ./pants dependees             13s
16:39:26.19 [INFO] Initialization options changed: reinitializing scheduler...
16:39:26.89 [INFO] Scheduler initialized.
16:39:28.07 [INFO] Initialization options changed: reinitializing scheduler...
16:39:29.14 [INFO] Scheduler initialized.
No such targets found. One more interesting observation, the ordering of files in the test report is different in CI and locally, when I run the same command.
w

witty-crayon-22786

09/07/2021, 11:41 PM
mm, that’s not ideal.
but possibly related to the other issue you’re experiencing with this problematic target.
d

dazzling-diamond-4749

09/07/2021, 11:42 PM
Local is Mac, and CI is Ubuntu 18
Is there any way to get job logs? Seems like they are suppressed. Maybe there is way to see what coveragepy is complaining about?
w

witty-crayon-22786

09/07/2021, 11:46 PM
good idea. you can pass
test --output=all
(or set
[test] output = "all"
) to have the tests dump their output, which might include some more information about why the
.coverage
file wasn’t written
2
d

dazzling-diamond-4749

09/07/2021, 11:52 PM
While I wait for CI, just wanna get a sanity check on my test script. Maybe I wrote something silly
Copy code
./pants -l=error filter --filter-target-type=python_tests :: | sort > /tmp/all_tests

total_tests=$(cat /tmp/all_tests | wc -l)

total_blocks=$BUILDKITE_PARALLEL_JOB_COUNT
if test -z "$total_blocks"; then
    total_blocks=1
fi
current_block=$BUILDKITE_PARALLEL_JOB
if test -z "$current_block"; then
    current_block=0
fi

block_size=$(((total_tests/total_blocks)+1))

size as well
split -l $block_size -a 1 /tmp/all_tests test_target_set_

extra_space=$((block_size*total_blocks-total_tests))
empty_blocks=$((extra_space/block_size))

alphabet=({a..z})
block_index=${alphabet[current_block]}

# Skip empty blocks
if [ "$current_block" -ge "$((total_blocks-empty_blocks))" ]; then
    exit 0
fi

tests_to_run=$(cat test_target_set_$block_index | tr "\n" " ")

./pants test --output=all $tests_to_run
I found the cause. The test crashed and no coverage was ever generated. I'm still trying to figure out why running the test alone vs running groups tests would result in test passing vs crashing.
the test seems to crash without logs. No core dump no stacktrace 🤦‍♂️
👀 1
h

hundreds-father-404

09/08/2021, 1:02 AM
do you have the full Pants run for that?
fyi Rob shared a log over DM (for privacy). Pants reports that the test fails with exit code -9, which is sigkill and usually OOM killed @dazzling-diamond-4749 you said local is macOS, which has no OOM killer Do you expect the tests to be using lots of memory?
🙏 1
d

dazzling-diamond-4749

09/08/2021, 1:14 AM
Do you expect the tests to be using lots of memory?
Ohhhhhh, maybe. Its loading BERT, LOL
👆 1
h

hundreds-father-404

09/08/2021, 1:15 AM
d

dazzling-diamond-4749

09/08/2021, 1:16 AM
I see. I guess, is there a way to print OOM killed instead of just exiting -9? I'd be happy to contribute code for this
h

hundreds-father-404

09/08/2021, 1:19 AM
Hmmm maybe, yeah. You can't get Pytest itself to do that, it gets killed abruptly without being able to clean up The exit code line comes from https://github.com/pantsbuild/pants/blob/824b8d70dbdfb86cc89b1bebfaea9711fca4a715/src/python/pants/core/goals/test.py#L127. Maybe we should augment it to detect exit code's of 9 and -9? Or in general when running any Process, detect that exit code and always log a warning? That would catch things like resolving requirements, running a linter, etc, rather than just running the test I'm not sure, always tricky to know when to add special casing like this vs. when it's distracting, including the risk of leading people astray. What do you think?
d

dazzling-diamond-4749

09/08/2021, 1:23 AM
Let me read a bit on linux OOM killer. Maybe there is some stable interface we can depend on to print OOM killed?
w

witty-crayon-22786

09/08/2021, 1:53 AM
Negative exit codes are always signals. So we could convert that into something like "received signal SIG$X"
d

dazzling-diamond-4749

09/08/2021, 1:55 AM
so
exit -x
is -->
kill -x
?
w

witty-crayon-22786

09/08/2021, 2:03 AM
Yes, although it looks like this is a bit process-API specific. I believe it's always true in our usage of the Rust process APIs.
d

dazzling-diamond-4749

09/08/2021, 2:03 AM
I see, I was just reading about
128 + SIG
as the standard convention
d

dazzling-diamond-4749

09/08/2021, 2:08 AM
nice! I'll open an issue and send a PR. + a test that make sure
SIGINT
is correctly caught and surfaced as
Neg(sig)
<-- does this suffice?
w

witty-crayon-22786

09/08/2021, 2:17 AM
Yea, maybe... I just realized that it might not be iron clad, because when we remote execute a process, we won't be able to get the signal. So maybe the better thing to do would be to change the rust code near where I linked above to append some information about the signal to stderr...
d

dazzling-diamond-4749

09/08/2021, 2:18 AM
I see. so ?
Copy code
exit_status.signal().map(tee(print)).map(Neg::neg))
w

witty-crayon-22786

09/08/2021, 2:18 AM
But do feel free to file the issue: if you're interested in fixing it I can add more info about how. If not, I can take a look.
2
🙏 1
d

dazzling-diamond-4749

09/08/2021, 2:18 AM
I don't rust at all. Just guess there has to be some thing like
tee
h

happy-kitchen-89482

09/08/2021, 9:44 AM
To clarify my understanding of this entire thread, when you say "parallelize tests in buildkite" do you mean sharding across multiple buildkite machines? Or running tests in parallel on a single machine? Pants will happily take care of the latter
4 Views