https://pantsbuild.org/ logo
h

happy-kitchen-89482

08/05/2019, 3:14 PM
I might have to murder some caches
e

enough-analyst-54434

08/07/2019, 4:26 PM
I'm hitting this continually and checking random master reds (~all CI builds of master are red for ages :/) this problem goes back at least a few weeks. The state of master in CI is beyond unhealthy.
I too had to nuke the corresponding shard cache ( Integration tests - shard 19 (Py3.6 PEX)) on master to get past this. Nuking my PR cache was not enough. This implies between Benjy's nukes + green and mine, a bad cache got written on master and this indicates a systemic bug afaict. I'll file as many details as I can.
h

happy-kitchen-89482

08/07/2019, 5:31 PM
Yeah, this is borderline disastrous.
w

witty-crayon-22786

08/07/2019, 5:58 PM
that cache issue is fairly different from the rest of the failures
currently, there are likely a fair number of failures due to timeouts post https://groups.google.com/d/topic/pants-devel/pN2fzqCIk-k/discussion
e

enough-analyst-54434

08/07/2019, 5:59 PM
Agreed, but its persistent and even discovering its different is problematic since we are soo red. Which goes back prior to that email afaict.
w

witty-crayon-22786

08/07/2019, 6:00 PM
well, the point of that email was to address a particular type of failure: the timeout
e

enough-analyst-54434

08/07/2019, 6:00 PM
Agreed - no issues with the email - just the bleeding on master for a time period that spans well before that to now.
Master is bankrupt.
w

witty-crayon-22786

08/07/2019, 6:00 PM
there are other flaky tests, definitely. but i just want to point out that one type of potential failure changed modes (timeouts), and likely still needs everyone's help to tune
e

enough-analyst-54434

08/07/2019, 6:01 PM
I think perhaps we should all agree never to merge on red. You must re-try shards until green to feel the pain each and every time. Perhaps that gets folks fixing flakes?
Even I don't believe my last sentence, but an otherwise healthy project looks really unhealthy.
w

witty-crayon-22786

08/07/2019, 6:05 PM
in general, we are not merging on red.
e

enough-analyst-54434

08/07/2019, 6:06 PM
OK. Well I'm not sure of a solution yet, but master shouldn't be this red by any sane measure.
w

witty-crayon-22786

08/07/2019, 6:06 PM
looking at https://travis-ci.org/pantsbuild/pants/builds/568680753 , it's all timeouts, which means the email above is very highly relevant
(and some network flakes)
and the action item for timeouts right now is very clear
e

enough-analyst-54434

08/07/2019, 6:07 PM
OK. Afiact no-one has done anything with that email. We have been in a similar situation with flakes afaict. Folks label them, but rarely does anyone dive into the issue and fix.
There is no incentive that I can see to get master green. I think thats the root issue.
Clicking retry is too easy.
w

witty-crayon-22786

08/07/2019, 6:12 PM
i wish i had an answer. i don't. for now, i'm going to do the thing that i optimistically asked other folks to take some time to do
(increase the timeouts for tests that are failing on master)
ftr: having audited the non-cron (separate but also important topic) failures for the last few runs, only 1 of the failures was due to a flaky test, and the rest were timeouts or network failures causing shards to fail to start
i will open a PR to increase those timeouts, and have bumped the one flaky test ticket.
e

enough-analyst-54434

08/07/2019, 6:23 PM
I'll have some data here soon and craft an email to comitters if it is as my sampling suggested. I appreciate the work you've done here with timeouts to clarify real failure resaons, I truly do think the problem is well beyond this spate, and like you, I have no good ideas except painting the picture more starkly with embarrasing data.
w

witty-crayon-22786

08/07/2019, 6:23 PM
thanks.
based on which flaky-test tickets have been bumped recently, i think that a lot of them are orphaned/no-longer relevant.
e

enough-analyst-54434

08/07/2019, 7:56 PM
w

witty-crayon-22786

08/07/2019, 8:03 PM
thanks John... will look later today