Ugh, I'm modifying the `init-pants` action to set ...
# development
h
Ugh, I'm modifying the
init-pants
action to set up via scie-pants, and occasionally the setup just hangs on the final step of running
pants --version
, after it emits the version and appears to succeed. See, e.g., https://github.com/pantsbuild/setup/actions/runs/4126475639/jobs/7128451975
It should take 45 seconds, but it'll hang until the job gets canceled.
e
So the recent bug fix added thread pool shutdown explicitness. IIRC it used a 250ms timeout but maybe a place to double check. Also, should that action be running --no-pantsd? Ideally we'd be able to use pantsd in CI without reservation but I have not tracked the status of that.
h
I figure it should use pantsd or not based on the configuration in pants.ci.toml? I wouldn't want to hard-code either way.
But yeah, the thread pool shutdown seems a possible culprit
if you have an example repro PR though, i can take a look
h
It repros ~50% of the time in that job
I will SSH in and see if I can repro manually
w
k. back in about 45 minutes to take a look.
h
Naturally I can't repro via SSH
Notably, this only happens on linux, not macos
e
It does seem to be the case that Pants concurrency bugs seem to always surface on Linux and I imagine, by inference, that Linux kernel locking / thread scheduling must be finer grained than Mac / BSD allowing more diversity in orderings to happen more regularly.
h
Due to the pattern of when this was happening vs not I thought it might be related to named-caches being retrieved vs not retrieved from the CI cache, but I can confirm this is not the case.
Going for a run, will dig more later
Can also confirm that sometimes rerunning the job in the exact same state makes it work and sometimes not. So this does seem like a race condition.
FWIW I cannot (at least so far) reproduce this with pantsd turned off
w
blargh. well… i’m also planning on disappearing for a few hours. i’ll take a look this evening, but anything you can do to narrow the repro would be really helpful
e
I just got 2 nice big hangs in CI immediately after upgrading a scie-pants IT to use 2.15.0rc4 to work around `FATAL: exception not rethrown`: Linux: https://github.com/pantsbuild/scie-pants/actions/runs/4128694155/jobs/7133454406 Mac: https://github.com/pantsbuild/scie-pants/actions/runs/4128694155/jobs/7133454550 Not helpful except that this time a Mac too and another case of
--pantsd
. I'm about to turn off pantsd but I'll see if I can GH Action ssh debug after.
I have never seen hangs before, just the variants of `FATAL: exception not rethrown`; so my money is on the 2.15.0rc4 fix breaking other things here.
w
A backtrace from one of those would be worth its weight in gold.
I also noticed the other day that the SSH GitHub action can also be triggered only for canceled steps
e
That is definitely not true from my own past personal sshing
But I guess I don't know what you mean. Its its own step; so you need to get to that ssh step. Could be cancel, green or a never fail on a prior step to ignore its failure. AFAICT its ~anything that gets you to that magic ssh step allows ssh'ing.
w
Yeah, I didn't try the cancel one first hand: I do know that the failure hook works. If a previous step fails it will trigger the SSH handler
e
There has got to be a better way to do this. Debugging concurrency issues via CI ssh is pretty crappy. Maybe I can try some local hobbling exercises this weekend to see if I could have debugged this stuff locally.
h
Yeah, plus I haven't been able to reproduce any of this in SSH
w
Yea. If the cancel handler works, then it would allow for catching them when they happen naturally in CI. I'd love a better harness for local testing... if this was pure rust code, we could use loom for at least some cases.
But 9/10 times it's the GIL + some other lock.
e
It probably needs 2 days of code reading vs a debugger. Sometimes just understanding is way better. Right now we lean on you too much for that.
If you have to catch with a tool, it probably means you'll just get trapped again in my book, since you don't really fully grok it.
Ok, got a repro hang and was able to attach gdb and get full thread logs. I'm having a look now and will post shortly.
And I added a crappy analysis. I have no idea really what's going on with flows here. My head does not yet contain the model. I'll step back from this investigation. Let me know if you need anything more.
w