I might have to murder some caches Pants #general

Join Slack

I might have to murder some caches

# general

happy-kitchen-89482

08/05/2019, 3:14 PM

I might have to murder some caches

enough-analyst-54434

08/07/2019, 4:26 PM

I'm hitting this continually and checking random master reds (~all CI builds of master are red for ages :/) this problem goes back at least a few weeks. The state of master in CI is beyond unhealthy.

enough-analyst-54434

08/07/2019, 5:20 PM

I too had to nuke the corresponding shard cache ( Integration tests - shard 19 (Py3.6 PEX)) on master to get past this. Nuking my PR cache was not enough. This implies between Benjy's nukes + green and mine, a bad cache got written on master and this indicates a systemic bug afaict. I'll file as many details as I can.

happy-kitchen-89482

08/07/2019, 5:31 PM

Yeah, this is borderline disastrous.

witty-crayon-22786

08/07/2019, 5:58 PM

that cache issue is fairly different from the rest of the failures

witty-crayon-22786

08/07/2019, 5:59 PM

currently, there are likely a fair number of failures due to timeouts post https://groups.google.com/d/topic/pants-devel/pN2fzqCIk-k/discussion

enough-analyst-54434

08/07/2019, 5:59 PM

Agreed, but its persistent and even discovering its different is problematic since we are soo red. Which goes back prior to that email afaict.

witty-crayon-22786

08/07/2019, 6:00 PM

well, the point of that email was to address a particular type of failure: the timeout

enough-analyst-54434

08/07/2019, 6:00 PM

Agreed - no issues with the email - just the bleeding on master for a time period that spans well before that to now.

enough-analyst-54434

08/07/2019, 6:00 PM

Master is bankrupt.

witty-crayon-22786

08/07/2019, 6:00 PM

there are other flaky tests, definitely. but i just want to point out that one type of potential failure changed modes (timeouts), and likely still needs everyone's help to tune

enough-analyst-54434

08/07/2019, 6:01 PM

I think perhaps we should all agree never to merge on red. You must re-try shards until green to feel the pain each and every time. Perhaps that gets folks fixing flakes?

enough-analyst-54434

08/07/2019, 6:01 PM

Even I don't believe my last sentence, but an otherwise healthy project looks really unhealthy.

witty-crayon-22786

08/07/2019, 6:05 PM

in general, we are not merging on red.

enough-analyst-54434

08/07/2019, 6:06 PM

OK. Well I'm not sure of a solution yet, but master shouldn't be this red by any sane measure.

witty-crayon-22786

08/07/2019, 6:06 PM

looking at https://travis-ci.org/pantsbuild/pants/builds/568680753 , it's all timeouts, which means the email above is very highly relevant

witty-crayon-22786

08/07/2019, 6:07 PM

(and some network flakes)

witty-crayon-22786

08/07/2019, 6:07 PM

and the action item for timeouts right now is very clear

enough-analyst-54434

08/07/2019, 6:07 PM

OK. Afiact no-one has done anything with that email. We have been in a similar situation with flakes afaict. Folks label them, but rarely does anyone dive into the issue and fix.

enough-analyst-54434

08/07/2019, 6:08 PM

There is no incentive that I can see to get master green. I think thats the root issue.

enough-analyst-54434

08/07/2019, 6:08 PM

Clicking retry is too easy.

witty-crayon-22786

08/07/2019, 6:12 PM

i wish i had an answer. i don't. for now, i'm going to do the thing that i optimistically asked other folks to take some time to do

witty-crayon-22786

08/07/2019, 6:13 PM

(increase the timeouts for tests that are failing on master)

witty-crayon-22786

08/07/2019, 6:20 PM

ftr: having audited the non-cron (separate but also important topic) failures for the last few runs, only 1 of the failures was due to a flaky test, and the rest were timeouts or network failures causing shards to fail to start

witty-crayon-22786

08/07/2019, 6:21 PM

i will open a PR to increase those timeouts, and have bumped the one flaky test ticket.

enough-analyst-54434

08/07/2019, 6:23 PM

I'll have some data here soon and craft an email to comitters if it is as my sampling suggested. I appreciate the work you've done here with timeouts to clarify real failure resaons, I truly do think the problem is well beyond this spate, and like you, I have no good ideas except painting the picture more starkly with embarrasing data.

witty-crayon-22786

08/07/2019, 6:23 PM

thanks.

witty-crayon-22786

08/07/2019, 6:24 PM

based on which flaky-test tickets have been bumped recently, i think that a lot of them are orphaned/no-longer relevant.

witty-crayon-22786

08/07/2019, 6:25 PM

https://github.com/pantsbuild/pants/pull/8146

enough-analyst-54434

08/07/2019, 7:56 PM

I sent out the CI health data and a call for ideas to committers here: https://groups.google.com/d/topic/pants-committers/D-9zkIa9zbQ/discussion Anyone can see the data here: https://docs.google.com/spreadsheets/d/1F-zk4K37w6f9qlEXPJ9LGbzfYgvokgKqRLyhHE6niY8/edit?usp=sharing

witty-crayon-22786

08/07/2019, 8:03 PM

thanks John... will look later today

Open in Slack

Previous Next