I'm looking at <https://github.com/pantsbuild/pant...
# development
b
I'm looking at https://github.com/pantsbuild/pants/issues/15771 (Fatal error with
SIGINT
).
Copy code
Fatal Python error: This thread state must be current when releasing
Python runtime state: finalizing (tstate=0x2a10660)
The gist is looks like from the thread traces that the main Python thread has finalized while the thread still executing our rule code is running. Looks like a race condition, where normally our rule threads finish before Python finalizes. đŸ§”
IIUC I think we'll need to make sure the main thread blocks while we wait for our rule threads to finish
👍 1
h
Very hand-wavey comment: that sounds reasonable to me
😂 1
b
Well part of this post is me rubber duck debugging this :)
So any insight welcome. Stu pointed me to
src/rust/engine/src/session.rs
for exception handling
It kinda seems from https://docs.rs/tokio/0.2.20/tokio/runtime/struct.Runtime.html that the way to do this is to drop the runtime. It also looks like we use the runtime to run the signal handler 😂
w
yea, that seems likely. it sounds like it amounts to adding an explicit call to
scheduler_shutdown
somewhere (in LocalPantsRunner?) and then seeing whether it hangs
the assumption so far has been that the
Scheduler
is `Drop`d when all the python references to it have been dropped. which should happen as a natural consequence of the process exiting, afaik
 (i.e. while tearing down, the python refcount of the
PyScheduler
type should go to zero)
b
I'm missing where we're shutting down the runtime by dropping it.
w
@bitter-ability-32190:
Drop
is implicit in rust: it happens when something goes out of scope. so if the
PyScheduler
(which is refcounted by Python) is gc’d, its contents are
Drop
d
but yes: i think you’re on the right track re: making sure that the main thread waits for the scheduler to exit.
b
static ref GLOBAL_EXECUTOR: ArcSwapOption<Runtime> = ArcSwapOption::from_pointee(None);
Is the executor not global?
w
i don’t think that you necessarily have to wait for the
Executor
to shut down, because if it doesn’t interact with Python, it should be harmless
b
It holds the threads executing the rule code, though, right?
w
the executor is, yes. but the Scheduler is what is actually feeding it Python code to run. i suspect that just ensuring the Scheduler is torn down is sufficient.
but yea, you could be right that ensuring that the Executor is idle is necessary too.
would check first whether the Scheduler is being torn down
 if it is, that would point to the Executor also needing some sort of teardown
✅ 1
the ~only place where the Scheduler gives the Executor async* work is in
Graph::get
, which uses
tokio::spawn
to put async work on the Executor
b
ah waiting for idle đŸ€”
ok that all makes a lot more sense (I think)
Something's iffy, I see the Schedular drop happening soon after I start my command. And not anytime after Ctrl+C
--no-pantsd
w
a bootstrap scheduler is created, so you might be seeing that being dropped
not seeing it after the Ctrl+C though would definitely be something to look into
basically: is the LocalPantsRunner continuing through to completion? and if so, why isn’t the PyScheduler ending up dropped
👀 1
b
I don't think
Schedular.shutdown
is ever called anywhere? Therefore we aren't calling
scheduler_shutdown
w
iirc, waaaay back in the day (pre pyo3), it used to be added as an explicit hook in the python integration library we were using.
having an explicit call shouldn’t be necessary unless the
PyScheduler
isn’t actually getting gc’d (and thus
Drop
d).
but i suppose that i don’t really know what guarantees cpython makes about things being GC’d before a process exits.
b
I seem to remember Python doesn't make a whole lot of guarantees, especially if there's cycles, etc...
w
k. so maybe the answer is that we’ll need to be more explicit with shutdown, although that is an annoying game of whack a mole.
as a strawman: nuking the scheduler at the end of LocalPantsRunner

oh. but only when not running in
pantsd
(which passes in a live scheduler which is used across runs)
b
So just for my understanding, you're saying we expect the GC to delete and free the
PyScheduler
object, which owns the
Scheduler
which should then be `drop`ped?
w
correct
b
If I was more proficient at
gdb
we could likely expect the reference count for the
PyScheduler
in this scenario
(Also FWIW the behavior I'm seeing doesn't repro with pantsd, presumably because the process lives on)
So the
local_pants_runner
strawman might "just be it"
w
So the
local_pants_runner
strawman might “just be it”
yea. see above though
 semi-misleadingly, local pants runner is still used with
pantsd
: just with different constructor arguments. so it’s borrowing the scheduler in that case and shouldn’t kill it
b
đŸ˜”
Well "^C150123.02 [WARN] During shutdown: Some Sessions did not shutdown within 60s."
Using
self.graph_session.scheduler_session.scheduler.shutdown()
in the
finally
block
w
there should be a “Waiting for shutdown of X” message above that
one of which might be fishy
b
Copy code
15:00:23.01 [INFO] Waiting for shutdown of: ["pants_run_2022_07_06_15_00_14_671_07ddc7afbc9548ed81f58f10a10a88a2", "streaming_workunit_handler_session"]
15:00:23.01 [INFO] Shutdown completed: "pants_run_2022_07_06_15_00_14_671_07ddc7afbc9548ed81f58f10a10a88a2"
w
yea, the
finally
is inside of the
with
block for
streaming_workunit_handler_session
would want to teardown outside of that
with
block
b
ahh yeah
Hmm, still seeing the behavior when this is called outside/after the `StreamingWorkunitHandler`'s
__exit__
w
...oh. probably because it ends up waiting for its own session to be dropped
🐍 1
i.e. LocalPantsRunner.graph_session
That shutdown method is used in`pantsd` to wait for clients to be disconnected before restarting
b
You lost me
Oh shutdown itself is run using the executor, so it can't very-well wait for the executor to stop executing?
w
no: i mean that how a Session “goes away” is by being GC’d: and the LocalPantsRunner is holding a Session (in its
self.graph_session
field)
b
(Probably an aisde, but it seems the local pants running does use
src/python/pants/base/exception_sink.py
)
And I can confirm it is the one handling
SIGINT
and raising the
KeyboardInterrupt
. But I think this info doesn't help us here
w
yea, i don’t think that the Executor is relevant until/unless the Scheduler is successfully shut down (
and it still repros)
b
Copy code
print(sys.getrefcount(self))
            print(sys.getrefcount(self.graph_session))
            print(sys.getrefcount(self.graph_session.scheduler_session))
            print(sys.getrefcount(self.graph_session.scheduler_session.scheduler))
            print(sys.getrefcount(self.graph_session.scheduler_session.scheduler.py_scheduler))
gives me:
Copy code
5
2
4
3
2
Subtract 1 from each of those, for the temporaries
Aha! We have progress!
Having the
Scheduler
(in Python)
del self._py_scheduler
does then
drop
the
Scheduler
(in Rust). But also reproduces the error
Oh, but we still need to shutdown, huh? 🙈 Sorry, this is all a bit confusing. I need a graphic
Yeah the core shutdown does extra async with
Work*
w
i think that it is only interacting with Python that should be problematic
? i.e., trying to acquire the gil?
but you probably know better based on what you’ve seen here
is the issue that a thread has ever touched python and is still running, or only that it tries to touch python after
main
has exited
and 
 in that vein: is there a way to disable finalization to just not worry about this?
b
Well I think we need the blocking. What's happening is the main thread is done, but ours is still trying to do py work
Oh but I guess when the os destroys the process everything gets dropped, and therefore our code gets a chance to complete?
w
i am similarly (to Python shutdown semantics) not familiar with the exact guarantees of
Drop
when a process exits 
 it almost certainly depends “how” the process exits
i.e.
sys.exit
vs returning from the entrypoint
b
That's a good point I was musing as well. The fact that python bothers to call finalize even with sys.exit makes me think exit-by-exception isn't special. I'll poke at cpython code though
But then I also don't see this on normal process completion
I think that's the big breadcrumb here
Oh yeah looks like
sys.exit
is it. Oddly, the code kinda looks like it's calling
PyFinalize
int hat codepath and not the other
https://github.com/python/cpython/blob/8a0d9a6bb77a72cd8b9ece01b7c1163fff28029a/Python/pythonrun.c#L488 Calls
PyErr_Print
if the return value is
NULL
I suspect that is in the case of an exception (
sys.exit
just raises
SystemExit
) ... eventually in
handle_system_exit
we call
Py_Exit
which calls `Py_FinalizeEx`:https://github.com/python/cpython/blob/bec802dbb87717a23acb9c600c15f40bc98340a3/Python/pylifecycle.c#L2922
And we have a winner!
os._exit(...)
Although tracing through the code, it should be calling
finalize
in the happy path too 😕 So not sure why we don't see this otherwise, other than "race condition"
Ah nevermind, it's a hard exit and therefore the rule code just gets stopped. In the case of the issue, that means we don't clean up our mess
OK, so long-story-short @witty-crayon-22786 I don't think we can escape finalizing. Although no idea why this doesnt repro with normal exiting, and why it does repo with CTRL+C being swallowed (dont raise sys.exit)
Yeah I suspect
SIGINT
is the key here.
Oh here's a thought, should the session cancel be async? Or something in this ballnpark of allowing the current rules to finish before returning execution
w
Session cancellation is async, yea
 it’s a essentially a signal that things like InteractiveProcess watch out for while they are running. when they receive it, they’re supposed to exit cleanly.
b
Well
await
ting the cancellation wasn't the secret sauce either.
I set up a meeting since I can't seem to grasp all the moving parts
w
having said that, the
PySessionCancellationLatch
which is used to cancel
Sessions
in general may not be what gets triggered when a SIGINT comes in
nope, it is.
this is the bit where we’re running an interactive process. when the Session is canceled by the SIGINT, we should cleanly exit there: https://github.com/pantsbuild/pants/blob/900fe9287e76e963cac1844453792746053cb83d/src/rust/engine/src/intrinsics.rs#L676-L696
I set up a meeting
yea, sounds good. see you there
b
We do cleanly exit there. The InteractiveProcess receieves SIGINT and exits fine. The issue is the
async
rule running which was running process is attempting to cleanup and taking a very long time
w
got it. yea, that’s where waiting for the Executor might be necessary.
basically, all of the work which is running in the
Graph
is `spawn`d onto the Executor
 i.e., it’s running async in the background. when clients go away, it is cancelled as soon as possible, but not synchronously. so it might take some time for things to wrap up.
(the rule code needs to hit its next
await
before it can be canceled)
the Executor going completely idle 
 may not happen until after the Scheduler/Core are dropped because we spawn other work onto it that runs in loops until things have been dropped, etc.
b
✈ đŸ€Ż
waaaay over my head 😛
w
yea, if you can clarify what exactly it is we shouldn’t be doing, i can be thinking about the how not to do it aspect 😃
😅 1
b
We shouldn't be running any more python code after the main thread finalizes. I think what this means is we need to "shutdown" all the mechanics of running the rule threads, but wait for them to finish. We can place that anywhere.
w
yep. then i think that we will need to shutdown the Scheduler, and then also shutdown the Executor.
having said that, this does seem like exactly the situation in which https://pyo3.rs/v0.5.1/doc/pyo3/fn.prepare_freethreaded_python.html is relevant. i think that the issue in this case is that Python is starting first, rather than Rust
so it’s too late
 Python will already have been initialized.
b
Yeah, thats a no-op if we call it. Like you say
I think thats more for embedded python apps
w
which we hope to become soon(ish): https://github.com/pantsbuild/pants/issues/7369
it’s John’s next project once PEX lockfiles are stable
b
Y'all planning on using
PyO3
or
PyOxy
? 👀
w
um, i think that to get the static interpreter you need PyOxidizer, since they have python interpreter builds ready for embedding
we’d still be using py03 though
b
Yeah in my head it's
pyoxy
providing Python, but still using
pyo3
for the FFIing
w
maybe Python can be dynamically loaded without pyoxidizer
 unsure whether that would be a useful intermediate step.
yep
b
I'mg uessing the native client fits into this story as well?
w
somewhat, yea. the native client is mostly just about avoiding starting a python interpreter to connect to
pantsd
. that could be shipped first, since the
pants
script could get a hold of the native client binary and invoke that rather than
python
at this point though, we think that the distribution model is more pressing. we want the lower latency, but introducing shenangians to the
pants
script to get it increases complexity that we want to be removing instead
✅ 2
b
Oh dang my
debugpy
shenanigans will need to be shifted. HMU when that needs to be done, I dont wanna bog John down with that