https://pantsbuild.org/ logo
e

enough-analyst-54434

12/27/2021, 7:21 PM
Re: https://github.com/pantsbuild/pants/issues/13899 Is there any objection to moving the 2 coursier scripts out of bin/ and into a scripts/ dir to make them independent of the immutable input digest for the coursier binary itself? That would stop the bleeding in CI and buy time for someone to solve the overlap issue.
Instead of adding my second +1 account to the issue thread, I'd rather do something useful here.
@fast-nail-55400 & @witty-crayon-22786 I think you two have the context here.
w

witty-crayon-22786

12/27/2021, 7:52 PM
no objections: i just haven’t triaged it yet to be able to confirm that that would fix it
lemme see
e

enough-analyst-54434

12/27/2021, 7:53 PM
Its a pretty tiny change until you get to all of the Processes that need to include a new bit in their normal input digest. That ripple is still simple, but large. So if you say go, I'll do it. There is no way it can't fix it since the paths no longer overlap and normal digests are use which work everywhere else.
ETXTBSY comes back as a possibility though.
I think that frequency of break though was much lower which both CI Pants dev consumers would appreciate as well as customers.
w

witty-crayon-22786

12/27/2021, 7:59 PM
Is there any objection to moving the 2 coursier scripts out of bin/ and into a scripts/ dir to make them independent of the immutable input digest for the coursier binary itself? That would stop the bleeding in CI and buy time for someone to solve the overlap issue.
it’s not clear to me that this would resolve it, since they’re currently part of the same digest, which means they should be materialized together in one call to
materialize_directoy
e

enough-analyst-54434

12/27/2021, 7:59 PM
My proposal splits them into a seperate digest at a seperate path.
w

witty-crayon-22786

12/27/2021, 8:00 PM
the fact that this is always in
setup_jdk
seems relevant… it’s something that that
@rule
invokes that is causing the trouble.
e

enough-analyst-54434

12/27/2021, 8:00 PM
Ok, fine I'll dig deeper. My fix works though.
I don;t think you actually read my proposal well.
w

witty-crayon-22786

12/27/2021, 8:01 PM
i think i don’t understand your hypothesis then… when you’re materializing a single digest, you create and populate each directory exactly once recursively
so three files in one directory as part of a single digest should be fine
e

enough-analyst-54434

12/27/2021, 8:01 PM
I directly state I split them
w

witty-crayon-22786

12/27/2021, 8:02 PM
i know: i don’t understand how that would prevent the bug
e

enough-analyst-54434

12/27/2021, 8:02 PM
There are now 2 digests, an immutable one for coursier and a normal one for the 2 scripts
the normal one gets merged into a normal process input digest.
and its at a different path
I used __coursier/bin for immutable and __coursier/scripts for the scripts.
IOW the normal digest has the scripts so they never get the immutable r/o treatment.
w

witty-crayon-22786

12/27/2021, 8:13 PM
afaict, what appears to be happening is that something else is being written as immutable at the relevant location, such that when we try to materialize Coursier’s digest (containing the three files) at
__coursier/*
, it fails. it’s not clear to me that moving those two files out won’t expose this same thing happening to the coursier binary (since as mentioned above: they’re all part of one digest, and one call to
materialize_directory
)
e

enough-analyst-54434

12/27/2021, 8:14 PM
So have I read thinks right in that for 1 scheduler we have 1 ImmutableInputs which has 1 fixed tmp dir?
w

witty-crayon-22786

12/27/2021, 8:15 PM
i can’t figure out from inspection why those files might be materialized more than once though, so … i don’t have any better theories. if it’s not a huge change, then we can try it, but it sounds non-trivial
e

enough-analyst-54434

12/27/2021, 8:15 PM
If its actions aren;'t memoized, which it doesn't seem they are, you write same paths twice to tmp dir
w

witty-crayon-22786

12/27/2021, 8:15 PM
If its actions aren;’t memoized, which it doesn’t seem they are, you write same paths twice to tmp dir
they are memoized by the
ImmutableInputs
struct
…but maybe i screwed up the locking in there… that would do it.
e

enough-analyst-54434

12/27/2021, 8:16 PM
But the lock is on digest not paths
Couldn't two digests intesect paths?
That would go boom
w

witty-crayon-22786

12/27/2021, 8:17 PM
no, they shouldn’t. because they share zero content: the digest is the root directory, and never overlaps other directories
no deduping of internal digests
e

enough-analyst-54434

12/27/2021, 8:18 PM
If I have a digest that maps to foo/bar + foo/baz and another to just foo/bar aren't those two seperate locks?
But overlapping tmp dir paths?
w

witty-crayon-22786

12/27/2021, 8:18 PM
no:
$tmp/${digest1}/foo/*
and
$tmp/${digest2}/foo/*
e

enough-analyst-54434

12/27/2021, 8:19 PM
Ah, missed the digest str append
w

witty-crayon-22786

12/27/2021, 8:19 PM
er: two separate locks, and two separate file paths
e

enough-analyst-54434

12/27/2021, 8:19 PM
Ok
Well, at any rate, the main point is this is bleeding in CI. We used to be bad about this. We got a whole lot better. It would be great not to slip again. I know there are non-OSS pressures though.
w

witty-crayon-22786

12/27/2021, 8:21 PM
yea, i’m fine with trying it, since i don’t have any better hypotheses right now. i just can’t say that i understand why it would work
e

enough-analyst-54434

12/27/2021, 8:22 PM
I think its unquestinable it will work, to your point, the problem will kick down to coursier bin.
w

witty-crayon-22786

12/27/2021, 8:22 PM
one way that would fail would be if
ImmutableInputs
fails to materialize a directory: we don’t currently clean up after a failed attempt, and so the next attempt to materialize will encounter a readonly layout. but i don’t see a failed attempt to materialize here
e

enough-analyst-54434

12/27/2021, 8:23 PM
Yeah - that's definitely a failure mode to fix whether or not it's what is happening here.
This looks buggy to me: https://github.com/chrislearn/double-checked-cell-async/blob/master/src/lib.rs#L232 I think that needs to be Ordering::Acquire. It appears the once_init the sync version of this is deprecated in favor of does use Acquire for its second check. I'm not super positive on my analysis though yet.
👀 1
w

witty-crayon-22786

12/27/2021, 9:40 PM
mm. the age of that crate always seemed like a liability. but yea, that would certainly do it.
e

enough-analyst-54434

12/27/2021, 9:41 PM
The thing is I thought all this went out the window on intel - I thought it had stricter semantics for atomics regardless. Very very handwavy on my part at this point.
That said - that crate, though new / active begs off production use.
How can double checked locking bugs still be a thing!
w

witty-crayon-22786

12/27/2021, 9:55 PM
sigh.
e

enough-analyst-54434

12/27/2021, 9:56 PM
👍 1
w

witty-crayon-22786

12/27/2021, 9:56 PM
i’ve only ever used orderings as conservatively as possible, because i don’t understand them.
the
tokio
folks have a library/framework to check all interleavings though: https://tokio.rs/blog/2019-10-scheduler#fearless-unsafe-concurrency-with-loom
e

enough-analyst-54434

12/27/2021, 9:57 PM
Yeah, I'm pretty shaky there too. Maybe its not too hard to flag in a switch between those two cell impls and I can flag the new one on in CI for a while? Or maybe there is just a simpler / slightly less efficient way to structure our code.
w

witty-crayon-22786

12/27/2021, 10:06 PM
another option would be to add an assert that the destination doesn’t already exist, to validate the hypothesis? incompatible with “doing cleanup if we fail to materialize”, but. given the “failed is valid, but needs to cleanup” case, it seems like maybe rather than a cell, this should* be an async Mutex with a state machine inside
e

enough-analyst-54434

12/27/2021, 10:08 PM
Yeah, I need to break here for a few hours but I'll return to this more wholistically this evening. I'd like to squash this dead and move on.
w

witty-crayon-22786

12/27/2021, 10:09 PM
ok, thanks a lot. i’m tied up this evening, but will take a look if i can later, or tomorrow if not
e

enough-analyst-54434

12/28/2021, 3:34 AM
Loom is great. Unfortunately it explicitly does not / cannot handle Ordering::Relaxed. I rigged u a quick loom test of the dcl code in parking_lot that fails correctly if the double-check writes with Ordering::Acquire for example, but - yeah, unfortunately not a tool that can be used for this case.
I'll try jut using normal code and hammering it to see if I can trigger a broken dcl event.
Actually - although I am curious to see if the dcl is broken - is a Pex / atomic_directory style lock and population of a permanent ~/.cache/pants/immutable_caches dir desirable? Right now we re-do unpacking from LMDB on every pantsd restart. The good side is auto-gc of course and no waiting on the cache goal ticket. The bad side is redoing this every time and ... potentially dcl bugs.
Tom's initial issue was effectively not having an atomic_directory construct / API in the rules.
w

witty-crayon-22786

12/28/2021, 5:31 PM
i’ll get out a sanity check to confirm the hypothesis shortly, unless you already have something pending
e

enough-analyst-54434

12/28/2021, 5:47 PM
Let's assume that confirms, do you have an opinion on fix dcl vs switch to atomic directory (which kills 2 birds re population failure cleanup)? Further do you have an opinion on the location of an atomic dir if you're ok with that idea - ephemeral like today vs in ~/.cache/pants?
w

witty-crayon-22786

12/28/2021, 5:50 PM
I would rather not add something that needs cleanup without also adding the cleanup. The other reason we left them ephemeral was to avoid needing to worry that the digest had been mutated and gone out of sync run-over-run.
But would we really want to rely on atomic renames to prevent N threads in one process from materializing stop one another? You'd still want (file) locking I presume
e

enough-analyst-54434

12/28/2021, 5:53 PM
Yes, you'd want file locking. That's how it's done over in Pex for most cases (bool flag).
I buy the cleanup bit. The digest mutated bit sounds unusefully paranoid. I'm probably not seeing some case you are? I'd think if our digesting was broken then way more would be broken too.
w

witty-crayon-22786

12/28/2021, 5:56 PM
Less of a concern since we've bothered to make the directory structure readonly I suppose. Code would have to override that (although some definitely does)
e

enough-analyst-54434

12/28/2021, 5:58 PM
Ok, so ... this is getting painful. Go or no go. I'm stepping in to fix bleeding but if you have opinions how it should be done I can step back.
w

witty-crayon-22786

12/28/2021, 6:00 PM
Do whatever you want. I always "have an opinion", but you know what you're about.
e

enough-analyst-54434

12/28/2021, 9:23 PM
I stuck with the userspace in-mem locking since the interprocess locking only makes sense if/when we go to storing this stuff in a stable
~/.cache/pants/...
dir: https://github.com/pantsbuild/pants/pull/14016
And that didn't work at all. I won't be hacking here for a while so I've released the issue back to the pool :/
w

witty-crayon-22786

01/07/2022, 11:55 PM
thanks!
is the description up to date?
e

enough-analyst-54434

01/08/2022, 12:26 AM
Yes. The macOS perms weirdness just made it look like the fix didn't work since it resulted in a different-reasoned perm error - although crucially - every single time, no ~30%.
w

witty-crayon-22786

01/08/2022, 12:26 AM
got it. thanks!