I m looking at <https github com pantsbuild pants issues 157 Pants #development

I'm looking at <https://github.com/pantsbuild/pant...

bitter-ability-32190

07/06/2022, 6:20 PM

I'm looking at https://github.com/pantsbuild/pants/issues/15771 (Fatal error with

SIGINT

Copy code

Fatal Python error: This thread state must be current when releasing
Python runtime state: finalizing (tstate=0x2a10660)

The gist is looks like from the thread traces that the main Python thread has finalized while the thread still executing our rule code is running. Looks like a race condition, where normally our rule threads finish before Python finalizes. 🧵

bitter-ability-32190

07/06/2022, 6:22 PM

IIUC I think we'll need to make sure the main thread blocks while we wait for our rule threads to finish

👍 1

hundreds-father-404

07/06/2022, 6:24 PM

Very hand-wavey comment: that sounds reasonable to me

😂 1

bitter-ability-32190

07/06/2022, 6:25 PM

Well part of this post is me rubber duck debugging this :)

bitter-ability-32190

07/06/2022, 6:25 PM

So any insight welcome. Stu pointed me to

src/rust/engine/src/session.rs

for exception handling

bitter-ability-32190

07/06/2022, 6:37 PM

It kinda seems from https://docs.rs/tokio/0.2.20/tokio/runtime/struct.Runtime.html that the way to do this is to drop the runtime. It also looks like we use the runtime to run the signal handler 😂

witty-crayon-22786

07/06/2022, 6:41 PM

yea, that seems likely. it sounds like it amounts to adding an explicit call to

scheduler_shutdown

somewhere (in LocalPantsRunner?) and then seeing whether it hangs

witty-crayon-22786

07/06/2022, 6:42 PM

the assumption so far has been that the

Scheduler

is `Drop`d when all the python references to it have been dropped. which should happen as a natural consequence of the process exiting, afaik… (i.e. while tearing down, the python refcount of the

PyScheduler

type should go to zero)

witty-crayon-22786

07/06/2022, 6:43 PM

you could confirm though by adding some debug output to https://github.com/pantsbuild/pants/blob/104f3e2e38aae724e8c7e570b3bf8a0874ecad7a/src/rust/engine/src/scheduler.rs#L340-L346 though maybe…?

bitter-ability-32190

07/06/2022, 6:44 PM

I'm missing where we're shutting down the runtime by dropping it.

witty-crayon-22786

07/06/2022, 6:44 PM

although, looking at it,

scheduler_shutdown

has a slightly different implementation https://github.com/pantsbuild/pants/blob/104f3e2e38aae724e8c7e570b3bf8a0874ecad7a/src/rust/engine/src/externs/interface.rs#L1028-L1038

witty-crayon-22786

07/06/2022, 6:45 PM

@bitter-ability-32190:

Drop

is implicit in rust: it happens when something goes out of scope. so if the

PyScheduler

(which is refcounted by Python) is gc’d, its contents are

Drop

witty-crayon-22786

07/06/2022, 6:46 PM

but yes: i think you’re on the right track re: making sure that the main thread waits for the scheduler to exit.

bitter-ability-32190

07/06/2022, 6:47 PM

static ref GLOBAL_EXECUTOR: ArcSwapOption<Runtime> = ArcSwapOption::from_pointee(None);

Is the executor not global?

witty-crayon-22786

07/06/2022, 6:47 PM

i don’t think that you necessarily have to wait for the

Executor

to shut down, because if it doesn’t interact with Python, it should be harmless

bitter-ability-32190

07/06/2022, 6:47 PM

It holds the threads executing the rule code, though, right?

witty-crayon-22786

07/06/2022, 6:47 PM

the executor is, yes. but the Scheduler is what is actually feeding it Python code to run. i suspect that just ensuring the Scheduler is torn down is sufficient.

witty-crayon-22786

07/06/2022, 6:48 PM

but yea, you could be right that ensuring that the Executor is idle is necessary too.

witty-crayon-22786

07/06/2022, 6:49 PM

would check first whether the Scheduler is being torn down… if it is, that would point to the Executor also needing some sort of teardown

✅ 1

witty-crayon-22786

07/06/2022, 6:50 PM

the ~only place where the Scheduler gives the Executor async* work is in

Graph::get

, which uses

tokio::spawn

to put async work on the Executor

bitter-ability-32190

07/06/2022, 6:50 PM

ah waiting for idle 🤔

bitter-ability-32190

07/06/2022, 6:57 PM

ok that all makes a lot more sense (I think)

bitter-ability-32190

07/06/2022, 7:03 PM

Something's iffy, I see the Schedular drop happening soon after I start my command. And not anytime after Ctrl+C

--no-pantsd

witty-crayon-22786

07/06/2022, 7:04 PM

a bootstrap scheduler is created, so you might be seeing that being dropped

witty-crayon-22786

07/06/2022, 7:20 PM

not seeing it after the Ctrl+C though would definitely be something to look into

witty-crayon-22786

07/06/2022, 7:20 PM

basically: is the LocalPantsRunner continuing through to completion? and if so, why isn’t the PyScheduler ending up dropped

👀 1

bitter-ability-32190

07/06/2022, 7:44 PM

I don't think

Schedular.shutdown

is ever called anywhere? Therefore we aren't calling

scheduler_shutdown

witty-crayon-22786

07/06/2022, 7:46 PM

iirc, waaaay back in the day (pre pyo3), it used to be added as an explicit hook in the python integration library we were using.

witty-crayon-22786

07/06/2022, 7:46 PM

having an explicit call shouldn’t be necessary unless the

PyScheduler

isn’t actually getting gc’d (and thus

Drop

d).

witty-crayon-22786

07/06/2022, 7:47 PM

but i suppose that i don’t really know what guarantees cpython makes about things being GC’d before a process exits.

bitter-ability-32190

07/06/2022, 7:49 PM

I seem to remember Python doesn't make a whole lot of guarantees, especially if there's cycles, etc...

witty-crayon-22786

07/06/2022, 7:52 PM

k. so maybe the answer is that we’ll need to be more explicit with shutdown, although that is an annoying game of whack a mole.

witty-crayon-22786

07/06/2022, 7:52 PM

as a strawman: nuking the scheduler at the end of LocalPantsRunner

witty-crayon-22786

07/06/2022, 7:53 PM

…oh. but only when not running in

pantsd

(which passes in a live scheduler which is used across runs)

bitter-ability-32190

07/06/2022, 7:53 PM

So just for my understanding, you're saying we expect the GC to delete and free the

PyScheduler

object, which owns the

Scheduler

which should then be `drop`ped?

witty-crayon-22786

07/06/2022, 7:53 PM

correct

bitter-ability-32190

07/06/2022, 7:54 PM

If I was more proficient at

gdb

we could likely expect the reference count for the

PyScheduler

in this scenario

bitter-ability-32190

07/06/2022, 7:55 PM

(Also FWIW the behavior I'm seeing doesn't repro with pantsd, presumably because the process lives on)

bitter-ability-32190

07/06/2022, 7:55 PM

So the

local_pants_runner

strawman might "just be it"

witty-crayon-22786

07/06/2022, 7:57 PM

So the
local_pants_runner
strawman might “just be it”

yea. see above though… semi-misleadingly, local pants runner is still used with

pantsd

: just with different constructor arguments. so it’s borrowing the scheduler in that case and shouldn’t kill it

bitter-ability-32190

07/06/2022, 7:58 PM

😵

bitter-ability-32190

07/06/2022, 8:01 PM

Well "^C150123.02 [WARN] During shutdown: Some Sessions did not shutdown within 60s."

bitter-ability-32190

07/06/2022, 8:02 PM

Using

self.graph_session.scheduler_session.scheduler.shutdown()

in the

finally

block

witty-crayon-22786

07/06/2022, 8:03 PM

there should be a “Waiting for shutdown of X” message above that

witty-crayon-22786

07/06/2022, 8:03 PM

one of which might be fishy

bitter-ability-32190

07/06/2022, 8:03 PM

Copy code

15:00:23.01 [INFO] Waiting for shutdown of: ["pants_run_2022_07_06_15_00_14_671_07ddc7afbc9548ed81f58f10a10a88a2", "streaming_workunit_handler_session"]
15:00:23.01 [INFO] Shutdown completed: "pants_run_2022_07_06_15_00_14_671_07ddc7afbc9548ed81f58f10a10a88a2"

witty-crayon-22786

07/06/2022, 8:03 PM

yea, the

finally

is inside of the

with

block for

streaming_workunit_handler_session

witty-crayon-22786

07/06/2022, 8:03 PM

would want to teardown outside of that

with

block

bitter-ability-32190

07/06/2022, 8:04 PM

ahh yeah

bitter-ability-32190

07/06/2022, 8:08 PM

Hmm, still seeing the behavior when this is called outside/after the `StreamingWorkunitHandler`'s

__exit__

witty-crayon-22786

07/06/2022, 8:11 PM

...oh. probably because it ends up waiting for its own session to be dropped

🐍 1

witty-crayon-22786

07/06/2022, 8:12 PM

i.e. LocalPantsRunner.graph_session

witty-crayon-22786

07/06/2022, 8:12 PM

That shutdown method is used in`pantsd` to wait for clients to be disconnected before restarting

bitter-ability-32190

07/06/2022, 8:17 PM

You lost me

bitter-ability-32190

07/06/2022, 8:21 PM

Oh shutdown itself is run using the executor, so it can't very-well wait for the executor to stop executing?

witty-crayon-22786

07/06/2022, 8:24 PM

no: i mean that how a Session “goes away” is by being GC’d: and the LocalPantsRunner is holding a Session (in its

self.graph_session

field)

bitter-ability-32190

07/06/2022, 8:25 PM

(Probably an aisde, but it seems the local pants running does use

src/python/pants/base/exception_sink.py

)

bitter-ability-32190

07/06/2022, 8:28 PM

And I can confirm it is the one handling

SIGINT

and raising the

KeyboardInterrupt

. But I think this info doesn't help us here

witty-crayon-22786

07/06/2022, 8:29 PM

yea, i don’t think that the Executor is relevant until/unless the Scheduler is successfully shut down (…and it still repros)

bitter-ability-32190

07/06/2022, 8:42 PM

Copy code

print(sys.getrefcount(self))
            print(sys.getrefcount(self.graph_session))
            print(sys.getrefcount(self.graph_session.scheduler_session))
            print(sys.getrefcount(self.graph_session.scheduler_session.scheduler))
            print(sys.getrefcount(self.graph_session.scheduler_session.scheduler.py_scheduler))

gives me:

Copy code

bitter-ability-32190

07/06/2022, 8:42 PM

Subtract 1 from each of those, for the temporaries

bitter-ability-32190

07/06/2022, 8:45 PM

Aha! We have progress!

bitter-ability-32190

07/06/2022, 8:46 PM

Having the

Scheduler

(in Python)

del self._py_scheduler

does then

drop

the

Scheduler

(in Rust). But also reproduces the error

bitter-ability-32190

07/06/2022, 8:47 PM

This drop: https://github.com/pantsbuild/pants/blob/104f3e2e38aae724e8c7e570b3bf8a0874ecad7a/src/rust/engine/src/scheduler.rs#L340

bitter-ability-32190

07/06/2022, 8:48 PM

Oh, but we still need to shutdown, huh? 🙈 Sorry, this is all a bit confusing. I need a graphic

bitter-ability-32190

07/06/2022, 9:09 PM

Yeah the core shutdown does extra async with

bitter-ability-32190

07/06/2022, 9:09 PM

Work*

witty-crayon-22786

07/06/2022, 9:21 PM

i think that it is only interacting with Python that should be problematic…? i.e., trying to acquire the gil?

witty-crayon-22786

07/06/2022, 9:21 PM

but you probably know better based on what you’ve seen here

witty-crayon-22786

07/06/2022, 9:22 PM

is the issue that a thread has ever touched python and is still running, or only that it tries to touch python after

main

has exited

witty-crayon-22786

07/06/2022, 9:22 PM

and … in that vein: is there a way to disable finalization to just not worry about this?

bitter-ability-32190

07/06/2022, 9:45 PM

Well I think we need the blocking. What's happening is the main thread is done, but ours is still trying to do py work

bitter-ability-32190

07/06/2022, 9:50 PM

Oh but I guess when the os destroys the process everything gets dropped, and therefore our code gets a chance to complete?

witty-crayon-22786

07/06/2022, 9:51 PM

i am similarly (to Python shutdown semantics) not familiar with the exact guarantees of

Drop

when a process exits … it almost certainly depends “how” the process exits

witty-crayon-22786

07/06/2022, 9:52 PM

i.e.

sys.exit

vs returning from the entrypoint

bitter-ability-32190

07/06/2022, 10:01 PM

That's a good point I was musing as well. The fact that python bothers to call finalize even with sys.exit makes me think exit-by-exception isn't special. I'll poke at cpython code though

bitter-ability-32190

07/06/2022, 10:09 PM

But then I also don't see this on normal process completion

bitter-ability-32190

07/06/2022, 10:12 PM

I think that's the big breadcrumb here

bitter-ability-32190

07/07/2022, 12:29 AM

Oh yeah looks like

sys.exit

is it. Oddly, the code kinda looks like it's calling

PyFinalize

int hat codepath and not the other

bitter-ability-32190

07/07/2022, 12:35 AM

https://github.com/python/cpython/blob/8a0d9a6bb77a72cd8b9ece01b7c1163fff28029a/Python/pythonrun.c#L488 Calls

PyErr_Print

if the return value is

NULL

I suspect that is in the case of an exception (

sys.exit

just raises

SystemExit

) ... eventually in

handle_system_exit

we call

Py_Exit

which calls `Py_FinalizeEx`:https://github.com/python/cpython/blob/bec802dbb87717a23acb9c600c15f40bc98340a3/Python/pylifecycle.c#L2922

bitter-ability-32190

07/07/2022, 12:39 AM

And we have a winner!

os._exit(...)

bitter-ability-32190

07/07/2022, 12:46 AM

Although tracing through the code, it should be calling

finalize

in the happy path too 😕 So not sure why we don't see this otherwise, other than "race condition"

bitter-ability-32190

07/07/2022, 12:52 AM

Ah nevermind, it's a hard exit and therefore the rule code just gets stopped. In the case of the issue, that means we don't clean up our mess

bitter-ability-32190

07/07/2022, 12:53 AM

OK, so long-story-short @witty-crayon-22786 I don't think we can escape finalizing. Although no idea why this doesnt repro with normal exiting, and why it does repo with CTRL+C being swallowed (dont raise sys.exit)

bitter-ability-32190

07/07/2022, 1:01 AM

Yeah I suspect

SIGINT

is the key here.

bitter-ability-32190

07/07/2022, 11:15 AM

Oh here's a thought, should the session cancel be async? Or something in this ballnpark of allowing the current rules to finish before returning execution

witty-crayon-22786

07/07/2022, 4:32 PM

Session cancellation is async, yea… it’s a essentially a signal that things like InteractiveProcess watch out for while they are running. when they receive it, they’re supposed to exit cleanly.

bitter-ability-32190

07/07/2022, 4:32 PM

Well

await

ting the cancellation wasn't the secret sauce either.

bitter-ability-32190

07/07/2022, 4:33 PM

I set up a meeting since I can't seem to grasp all the moving parts

witty-crayon-22786

07/07/2022, 4:33 PM

having said that, the
PySessionCancellationLatch
which is used to cancel
Sessions
in general may not be what gets triggered when a SIGINT comes in nope, it is.

witty-crayon-22786

07/07/2022, 4:35 PM

this is the bit where we’re running an interactive process. when the Session is canceled by the SIGINT, we should cleanly exit there: https://github.com/pantsbuild/pants/blob/900fe9287e76e963cac1844453792746053cb83d/src/rust/engine/src/intrinsics.rs#L676-L696

witty-crayon-22786

07/07/2022, 4:36 PM

I set up a meeting

yea, sounds good. see you there

bitter-ability-32190

07/07/2022, 4:37 PM

We do cleanly exit there. The InteractiveProcess receieves SIGINT and exits fine. The issue is the

async

rule running which was running process is attempting to cleanup and taking a very long time

witty-crayon-22786

07/07/2022, 4:49 PM

got it. yea, that’s where waiting for the Executor might be necessary.

witty-crayon-22786

07/07/2022, 4:50 PM

basically, all of the work which is running in the

Graph

is `spawn`d onto the Executor… i.e., it’s running async in the background. when clients go away, it is cancelled as soon as possible, but not synchronously. so it might take some time for things to wrap up.

witty-crayon-22786

07/07/2022, 4:51 PM

(the rule code needs to hit its next

await

before it can be canceled)

witty-crayon-22786

07/07/2022, 4:53 PM

the Executor going completely idle … may not happen until after the Scheduler/Core are dropped because we spawn other work onto it that runs in loops until things have been dropped, etc.

bitter-ability-32190

07/07/2022, 4:54 PM

✈️ 🤯

bitter-ability-32190

07/07/2022, 4:54 PM

waaaay over my head 😛

witty-crayon-22786

07/07/2022, 4:54 PM

yea, if you can clarify what exactly it is we shouldn’t be doing, i can be thinking about the how not to do it aspect 😃

😅 1

bitter-ability-32190

07/07/2022, 5:00 PM

We shouldn't be running any more python code after the main thread finalizes. I think what this means is we need to "shutdown" all the mechanics of running the rule threads, but wait for them to finish. We can place that anywhere.

witty-crayon-22786

07/07/2022, 5:05 PM

yep. then i think that we will need to shutdown the Scheduler, and then also shutdown the Executor.

witty-crayon-22786

07/07/2022, 5:07 PM

having said that, this does seem like exactly the situation in which https://pyo3.rs/v0.5.1/doc/pyo3/fn.prepare_freethreaded_python.html is relevant. i think that the issue in this case is that Python is starting first, rather than Rust

witty-crayon-22786

07/07/2022, 5:07 PM

so it’s too late… Python will already have been initialized.

bitter-ability-32190

07/07/2022, 5:07 PM

Yeah, thats a no-op if we call it. Like you say

bitter-ability-32190

07/07/2022, 5:08 PM

I think thats more for embedded python apps

witty-crayon-22786

07/07/2022, 5:08 PM

which we hope to become soon(ish): https://github.com/pantsbuild/pants/issues/7369

witty-crayon-22786

07/07/2022, 5:08 PM

it’s John’s next project once PEX lockfiles are stable

bitter-ability-32190

07/07/2022, 5:09 PM

Y'all planning on using

PyO3

PyOxy

? 👀

witty-crayon-22786

07/07/2022, 5:10 PM

um, i think that to get the static interpreter you need PyOxidizer, since they have python interpreter builds ready for embedding

witty-crayon-22786

07/07/2022, 5:10 PM

we’d still be using py03 though

bitter-ability-32190

07/07/2022, 5:10 PM

Yeah in my head it's

pyoxy

providing Python, but still using

pyo3

for the FFIing

witty-crayon-22786

07/07/2022, 5:11 PM

maybe Python can be dynamically loaded without pyoxidizer… unsure whether that would be a useful intermediate step.

witty-crayon-22786

07/07/2022, 5:11 PM

yep

bitter-ability-32190

07/07/2022, 5:11 PM

I'mg uessing the native client fits into this story as well?

witty-crayon-22786

07/07/2022, 5:12 PM

somewhat, yea. the native client is mostly just about avoiding starting a python interpreter to connect to

pantsd

. that could be shipped first, since the

pants

script could get a hold of the native client binary and invoke that rather than

python

witty-crayon-22786

07/07/2022, 5:13 PM

at this point though, we think that the distribution model is more pressing. we want the lower latency, but introducing shenangians to the

pants

script to get it increases complexity that we want to be removing instead

✅ 2

bitter-ability-32190

07/07/2022, 5:14 PM

Oh dang my

debugpy

shenanigans will need to be shifted. HMU when that needs to be done, I dont wanna bog John down with that

7 Views

Open in Slack

Previous Next