I might have to murder some caches
# general
h
I might have to murder some caches
e
I'm hitting this continually and checking random master reds (~all CI builds of master are red for ages :/) this problem goes back at least a few weeks. The state of master in CI is beyond unhealthy.
I too had to nuke the corresponding shard cache ( Integration tests - shard 19 (Py3.6 PEX)) on master to get past this. Nuking my PR cache was not enough. This implies between Benjy's nukes + green and mine, a bad cache got written on master and this indicates a systemic bug afaict. I'll file as many details as I can.
h
Yeah, this is borderline disastrous.
w
that cache issue is fairly different from the rest of the failures
currently, there are likely a fair number of failures due to timeouts post https://groups.google.com/d/topic/pants-devel/pN2fzqCIk-k/discussion
e
Agreed, but its persistent and even discovering its different is problematic since we are soo red. Which goes back prior to that email afaict.
w
well, the point of that email was to address a particular type of failure: the timeout
e
Agreed - no issues with the email - just the bleeding on master for a time period that spans well before that to now.
Master is bankrupt.
w
there are other flaky tests, definitely. but i just want to point out that one type of potential failure changed modes (timeouts), and likely still needs everyone's help to tune
e
I think perhaps we should all agree never to merge on red. You must re-try shards until green to feel the pain each and every time. Perhaps that gets folks fixing flakes?
Even I don't believe my last sentence, but an otherwise healthy project looks really unhealthy.
w
in general, we are not merging on red.
e
OK. Well I'm not sure of a solution yet, but master shouldn't be this red by any sane measure.
w
looking at https://travis-ci.org/pantsbuild/pants/builds/568680753 , it's all timeouts, which means the email above is very highly relevant
(and some network flakes)
and the action item for timeouts right now is very clear
e
OK. Afiact no-one has done anything with that email. We have been in a similar situation with flakes afaict. Folks label them, but rarely does anyone dive into the issue and fix.
There is no incentive that I can see to get master green. I think thats the root issue.
Clicking retry is too easy.
w
i wish i had an answer. i don't. for now, i'm going to do the thing that i optimistically asked other folks to take some time to do
(increase the timeouts for tests that are failing on master)
ftr: having audited the non-cron (separate but also important topic) failures for the last few runs, only 1 of the failures was due to a flaky test, and the rest were timeouts or network failures causing shards to fail to start
i will open a PR to increase those timeouts, and have bumped the one flaky test ticket.
e
I'll have some data here soon and craft an email to comitters if it is as my sampling suggested. I appreciate the work you've done here with timeouts to clarify real failure resaons, I truly do think the problem is well beyond this spate, and like you, I have no good ideas except painting the picture more starkly with embarrasing data.
w
thanks.
based on which flaky-test tickets have been bumped recently, i think that a lot of them are orphaned/no-longer relevant.
e
w
thanks John... will look later today