I might have to murder some caches
# general
I might have to murder some caches
I'm hitting this continually and checking random master reds (~all CI builds of master are red for ages :/) this problem goes back at least a few weeks. The state of master in CI is beyond unhealthy.
I too had to nuke the corresponding shard cache ( Integration tests - shard 19 (Py3.6 PEX)) on master to get past this. Nuking my PR cache was not enough. This implies between Benjy's nukes + green and mine, a bad cache got written on master and this indicates a systemic bug afaict. I'll file as many details as I can.
Yeah, this is borderline disastrous.
that cache issue is fairly different from the rest of the failures
currently, there are likely a fair number of failures due to timeouts post https://groups.google.com/d/topic/pants-devel/pN2fzqCIk-k/discussion
Agreed, but its persistent and even discovering its different is problematic since we are soo red. Which goes back prior to that email afaict.
well, the point of that email was to address a particular type of failure: the timeout
Agreed - no issues with the email - just the bleeding on master for a time period that spans well before that to now.
Master is bankrupt.
there are other flaky tests, definitely. but i just want to point out that one type of potential failure changed modes (timeouts), and likely still needs everyone's help to tune
I think perhaps we should all agree never to merge on red. You must re-try shards until green to feel the pain each and every time. Perhaps that gets folks fixing flakes?
Even I don't believe my last sentence, but an otherwise healthy project looks really unhealthy.
in general, we are not merging on red.
OK. Well I'm not sure of a solution yet, but master shouldn't be this red by any sane measure.
looking at https://travis-ci.org/pantsbuild/pants/builds/568680753 , it's all timeouts, which means the email above is very highly relevant
(and some network flakes)
and the action item for timeouts right now is very clear
OK. Afiact no-one has done anything with that email. We have been in a similar situation with flakes afaict. Folks label them, but rarely does anyone dive into the issue and fix.
There is no incentive that I can see to get master green. I think thats the root issue.
Clicking retry is too easy.
i wish i had an answer. i don't. for now, i'm going to do the thing that i optimistically asked other folks to take some time to do
(increase the timeouts for tests that are failing on master)
ftr: having audited the non-cron (separate but also important topic) failures for the last few runs, only 1 of the failures was due to a flaky test, and the rest were timeouts or network failures causing shards to fail to start
i will open a PR to increase those timeouts, and have bumped the one flaky test ticket.
I'll have some data here soon and craft an email to comitters if it is as my sampling suggested. I appreciate the work you've done here with timeouts to clarify real failure resaons, I truly do think the problem is well beyond this spate, and like you, I have no good ideas except painting the picture more starkly with embarrasing data.
based on which flaky-test tickets have been bumped recently, i think that a lot of them are orphaned/no-longer relevant.
thanks John... will look later today