Re: <https://github.com/pantsbuild/pants/issues/13...
# development
e
Re: https://github.com/pantsbuild/pants/issues/13899 Is there any objection to moving the 2 coursier scripts out of bin/ and into a scripts/ dir to make them independent of the immutable input digest for the coursier binary itself? That would stop the bleeding in CI and buy time for someone to solve the overlap issue.
Instead of adding my second +1 account to the issue thread, I'd rather do something useful here.
@fast-nail-55400 & @witty-crayon-22786 I think you two have the context here.
w
no objections: i just haven’t triaged it yet to be able to confirm that that would fix it
lemme see
e
Its a pretty tiny change until you get to all of the Processes that need to include a new bit in their normal input digest. That ripple is still simple, but large. So if you say go, I'll do it. There is no way it can't fix it since the paths no longer overlap and normal digests are use which work everywhere else.
ETXTBSY comes back as a possibility though.
I think that frequency of break though was much lower which both CI Pants dev consumers would appreciate as well as customers.
w
Is there any objection to moving the 2 coursier scripts out of bin/ and into a scripts/ dir to make them independent of the immutable input digest for the coursier binary itself? That would stop the bleeding in CI and buy time for someone to solve the overlap issue.
it’s not clear to me that this would resolve it, since they’re currently part of the same digest, which means they should be materialized together in one call to
materialize_directoy
e
My proposal splits them into a seperate digest at a seperate path.
w
the fact that this is always in
setup_jdk
seems relevant… it’s something that that
@rule
invokes that is causing the trouble.
e
Ok, fine I'll dig deeper. My fix works though.
I don;t think you actually read my proposal well.
w
i think i don’t understand your hypothesis then… when you’re materializing a single digest, you create and populate each directory exactly once recursively
so three files in one directory as part of a single digest should be fine
e
I directly state I split them
w
i know: i don’t understand how that would prevent the bug
e
There are now 2 digests, an immutable one for coursier and a normal one for the 2 scripts
the normal one gets merged into a normal process input digest.
and its at a different path
I used __coursier/bin for immutable and __coursier/scripts for the scripts.
IOW the normal digest has the scripts so they never get the immutable r/o treatment.
w
afaict, what appears to be happening is that something else is being written as immutable at the relevant location, such that when we try to materialize Coursier’s digest (containing the three files) at
__coursier/*
, it fails. it’s not clear to me that moving those two files out won’t expose this same thing happening to the coursier binary (since as mentioned above: they’re all part of one digest, and one call to
materialize_directory
)
e
So have I read thinks right in that for 1 scheduler we have 1 ImmutableInputs which has 1 fixed tmp dir?
w
i can’t figure out from inspection why those files might be materialized more than once though, so … i don’t have any better theories. if it’s not a huge change, then we can try it, but it sounds non-trivial
e
If its actions aren;'t memoized, which it doesn't seem they are, you write same paths twice to tmp dir
w
If its actions aren;’t memoized, which it doesn’t seem they are, you write same paths twice to tmp dir
they are memoized by the
ImmutableInputs
struct
…but maybe i screwed up the locking in there… that would do it.
e
But the lock is on digest not paths
Couldn't two digests intesect paths?
That would go boom
w
no, they shouldn’t. because they share zero content: the digest is the root directory, and never overlaps other directories
no deduping of internal digests
e
If I have a digest that maps to foo/bar + foo/baz and another to just foo/bar aren't those two seperate locks?
But overlapping tmp dir paths?
w
no:
$tmp/${digest1}/foo/*
and
$tmp/${digest2}/foo/*
e
Ah, missed the digest str append
w
er: two separate locks, and two separate file paths
e
Ok
Well, at any rate, the main point is this is bleeding in CI. We used to be bad about this. We got a whole lot better. It would be great not to slip again. I know there are non-OSS pressures though.
w
yea, i’m fine with trying it, since i don’t have any better hypotheses right now. i just can’t say that i understand why it would work
e
I think its unquestinable it will work, to your point, the problem will kick down to coursier bin.
w
one way that would fail would be if
ImmutableInputs
fails to materialize a directory: we don’t currently clean up after a failed attempt, and so the next attempt to materialize will encounter a readonly layout. but i don’t see a failed attempt to materialize here
e
Yeah - that's definitely a failure mode to fix whether or not it's what is happening here.
This looks buggy to me: https://github.com/chrislearn/double-checked-cell-async/blob/master/src/lib.rs#L232 I think that needs to be Ordering::Acquire. It appears the once_init the sync version of this is deprecated in favor of does use Acquire for its second check. I'm not super positive on my analysis though yet.
👀 1
w
mm. the age of that crate always seemed like a liability. but yea, that would certainly do it.
e
The thing is I thought all this went out the window on intel - I thought it had stricter semantics for atomics regardless. Very very handwavy on my part at this point.
That said - that crate, though new / active begs off production use.
How can double checked locking bugs still be a thing!
w
sigh.
e
👍 1
w
i’ve only ever used orderings as conservatively as possible, because i don’t understand them.
the
tokio
folks have a library/framework to check all interleavings though: https://tokio.rs/blog/2019-10-scheduler#fearless-unsafe-concurrency-with-loom
e
Yeah, I'm pretty shaky there too. Maybe its not too hard to flag in a switch between those two cell impls and I can flag the new one on in CI for a while? Or maybe there is just a simpler / slightly less efficient way to structure our code.
w
another option would be to add an assert that the destination doesn’t already exist, to validate the hypothesis? incompatible with “doing cleanup if we fail to materialize”, but. given the “failed is valid, but needs to cleanup” case, it seems like maybe rather than a cell, this should* be an async Mutex with a state machine inside
e
Yeah, I need to break here for a few hours but I'll return to this more wholistically this evening. I'd like to squash this dead and move on.
w
ok, thanks a lot. i’m tied up this evening, but will take a look if i can later, or tomorrow if not
e
Loom is great. Unfortunately it explicitly does not / cannot handle Ordering::Relaxed. I rigged u a quick loom test of the dcl code in parking_lot that fails correctly if the double-check writes with Ordering::Acquire for example, but - yeah, unfortunately not a tool that can be used for this case.
I'll try jut using normal code and hammering it to see if I can trigger a broken dcl event.
Actually - although I am curious to see if the dcl is broken - is a Pex / atomic_directory style lock and population of a permanent ~/.cache/pants/immutable_caches dir desirable? Right now we re-do unpacking from LMDB on every pantsd restart. The good side is auto-gc of course and no waiting on the cache goal ticket. The bad side is redoing this every time and ... potentially dcl bugs.
Tom's initial issue was effectively not having an atomic_directory construct / API in the rules.
w
i’ll get out a sanity check to confirm the hypothesis shortly, unless you already have something pending
e
Let's assume that confirms, do you have an opinion on fix dcl vs switch to atomic directory (which kills 2 birds re population failure cleanup)? Further do you have an opinion on the location of an atomic dir if you're ok with that idea - ephemeral like today vs in ~/.cache/pants?
w
I would rather not add something that needs cleanup without also adding the cleanup. The other reason we left them ephemeral was to avoid needing to worry that the digest had been mutated and gone out of sync run-over-run.
But would we really want to rely on atomic renames to prevent N threads in one process from materializing stop one another? You'd still want (file) locking I presume
e
Yes, you'd want file locking. That's how it's done over in Pex for most cases (bool flag).
I buy the cleanup bit. The digest mutated bit sounds unusefully paranoid. I'm probably not seeing some case you are? I'd think if our digesting was broken then way more would be broken too.
w
Less of a concern since we've bothered to make the directory structure readonly I suppose. Code would have to override that (although some definitely does)
e
Ok, so ... this is getting painful. Go or no go. I'm stepping in to fix bleeding but if you have opinions how it should be done I can step back.
w
Do whatever you want. I always "have an opinion", but you know what you're about.
e
I stuck with the userspace in-mem locking since the interprocess locking only makes sense if/when we go to storing this stuff in a stable
~/.cache/pants/...
dir: https://github.com/pantsbuild/pants/pull/14016
And that didn't work at all. I won't be hacking here for a while so I've released the issue back to the pool :/
w
thanks!
is the description up to date?
e
Yes. The macOS perms weirdness just made it look like the fix didn't work since it resulted in a different-reasoned perm error - although crucially - every single time, no ~30%.
w
got it. thanks!