Re <https github com pantsbuild pants issues 13899> Is there Pants #development

Re: <https://github.com/pantsbuild/pants/issues/13...

enough-analyst-54434

12/27/2021, 7:21 PM

Re: https://github.com/pantsbuild/pants/issues/13899 Is there any objection to moving the 2 coursier scripts out of bin/ and into a scripts/ dir to make them independent of the immutable input digest for the coursier binary itself? That would stop the bleeding in CI and buy time for someone to solve the overlap issue.

enough-analyst-54434

12/27/2021, 7:22 PM

Instead of adding my second +1 account to the issue thread, I'd rather do something useful here.

enough-analyst-54434

12/27/2021, 7:23 PM

@fast-nail-55400 & @witty-crayon-22786 I think you two have the context here.

witty-crayon-22786

12/27/2021, 7:52 PM

no objections: i just haven’t triaged it yet to be able to confirm that that would fix it

witty-crayon-22786

12/27/2021, 7:53 PM

lemme see

enough-analyst-54434

12/27/2021, 7:53 PM

Its a pretty tiny change until you get to all of the Processes that need to include a new bit in their normal input digest. That ripple is still simple, but large. So if you say go, I'll do it. There is no way it can't fix it since the paths no longer overlap and normal digests are use which work everywhere else.

enough-analyst-54434

12/27/2021, 7:54 PM

ETXTBSY comes back as a possibility though.

enough-analyst-54434

12/27/2021, 7:54 PM

I think that frequency of break though was much lower which both CI Pants dev consumers would appreciate as well as customers.

witty-crayon-22786

12/27/2021, 7:59 PM

Is there any objection to moving the 2 coursier scripts out of bin/ and into a scripts/ dir to make them independent of the immutable input digest for the coursier binary itself? That would stop the bleeding in CI and buy time for someone to solve the overlap issue.

it’s not clear to me that this would resolve it, since they’re currently part of the same digest, which means they should be materialized together in one call to

materialize_directoy

enough-analyst-54434

12/27/2021, 7:59 PM

My proposal splits them into a seperate digest at a seperate path.

witty-crayon-22786

12/27/2021, 8:00 PM

the fact that this is always in

setup_jdk

seems relevant… it’s something that that

@rule

invokes that is causing the trouble.

enough-analyst-54434

12/27/2021, 8:00 PM

Ok, fine I'll dig deeper. My fix works though.

enough-analyst-54434

12/27/2021, 8:00 PM

I don;t think you actually read my proposal well.

witty-crayon-22786

12/27/2021, 8:01 PM

i think i don’t understand your hypothesis then… when you’re materializing a single digest, you create and populate each directory exactly once recursively

witty-crayon-22786

12/27/2021, 8:01 PM

so three files in one directory as part of a single digest should be fine

enough-analyst-54434

12/27/2021, 8:01 PM

I directly state I split them

witty-crayon-22786

12/27/2021, 8:02 PM

i know: i don’t understand how that would prevent the bug

enough-analyst-54434

12/27/2021, 8:02 PM

There are now 2 digests, an immutable one for coursier and a normal one for the 2 scripts

enough-analyst-54434

12/27/2021, 8:02 PM

the normal one gets merged into a normal process input digest.

enough-analyst-54434

12/27/2021, 8:02 PM

and its at a different path

enough-analyst-54434

12/27/2021, 8:02 PM

I used __coursier/bin for immutable and __coursier/scripts for the scripts.

enough-analyst-54434

12/27/2021, 8:05 PM

IOW the normal digest has the scripts so they never get the immutable r/o treatment.

witty-crayon-22786

12/27/2021, 8:13 PM

afaict, what appears to be happening is that something else is being written as immutable at the relevant location, such that when we try to materialize Coursier’s digest (containing the three files) at

__coursier/*

, it fails. it’s not clear to me that moving those two files out won’t expose this same thing happening to the coursier binary (since as mentioned above: they’re all part of one digest, and one call to

materialize_directory

)

enough-analyst-54434

12/27/2021, 8:14 PM

So have I read thinks right in that for 1 scheduler we have 1 ImmutableInputs which has 1 fixed tmp dir?

witty-crayon-22786

12/27/2021, 8:15 PM

i can’t figure out from inspection why those files might be materialized more than once though, so … i don’t have any better theories. if it’s not a huge change, then we can try it, but it sounds non-trivial

enough-analyst-54434

12/27/2021, 8:15 PM

If its actions aren;'t memoized, which it doesn't seem they are, you write same paths twice to tmp dir

witty-crayon-22786

12/27/2021, 8:15 PM

If its actions aren;’t memoized, which it doesn’t seem they are, you write same paths twice to tmp dir

they are memoized by the

ImmutableInputs

struct

witty-crayon-22786

12/27/2021, 8:16 PM

…but maybe i screwed up the locking in there… that would do it.

enough-analyst-54434

12/27/2021, 8:16 PM

But the lock is on digest not paths

enough-analyst-54434

12/27/2021, 8:16 PM

Couldn't two digests intesect paths?

enough-analyst-54434

12/27/2021, 8:16 PM

That would go boom

witty-crayon-22786

12/27/2021, 8:17 PM

no, they shouldn’t. because they share zero content: the digest is the root directory, and never overlaps other directories

witty-crayon-22786

12/27/2021, 8:17 PM

no deduping of internal digests

enough-analyst-54434

12/27/2021, 8:18 PM

If I have a digest that maps to foo/bar + foo/baz and another to just foo/bar aren't those two seperate locks?

enough-analyst-54434

12/27/2021, 8:18 PM

But overlapping tmp dir paths?

witty-crayon-22786

12/27/2021, 8:18 PM

no:

$tmp/${digest1}/foo/*

and

$tmp/${digest2}/foo/*

enough-analyst-54434

12/27/2021, 8:19 PM

Ah, missed the digest str append

witty-crayon-22786

12/27/2021, 8:19 PM

er: two separate locks, and two separate file paths

enough-analyst-54434

12/27/2021, 8:19 PM

enough-analyst-54434

12/27/2021, 8:20 PM

Well, at any rate, the main point is this is bleeding in CI. We used to be bad about this. We got a whole lot better. It would be great not to slip again. I know there are non-OSS pressures though.

witty-crayon-22786

12/27/2021, 8:21 PM

yea, i’m fine with trying it, since i don’t have any better hypotheses right now. i just can’t say that i understand why it would work

enough-analyst-54434

12/27/2021, 8:22 PM

I think its unquestinable it will work, to your point, the problem will kick down to coursier bin.

witty-crayon-22786

12/27/2021, 8:22 PM

one way that would fail would be if

ImmutableInputs

fails to materialize a directory: we don’t currently clean up after a failed attempt, and so the next attempt to materialize will encounter a readonly layout. but i don’t see a failed attempt to materialize here

enough-analyst-54434

12/27/2021, 8:23 PM

Yeah - that's definitely a failure mode to fix whether or not it's what is happening here.

enough-analyst-54434

12/27/2021, 9:38 PM

This looks buggy to me: https://github.com/chrislearn/double-checked-cell-async/blob/master/src/lib.rs#L232 I think that needs to be Ordering::Acquire. It appears the once_init the sync version of this is deprecated in favor of does use Acquire for its second check. I'm not super positive on my analysis though yet.

👀 1

witty-crayon-22786

12/27/2021, 9:40 PM

mm. the age of that crate always seemed like a liability. but yea, that would certainly do it.

enough-analyst-54434

12/27/2021, 9:41 PM

The thing is I thought all this went out the window on intel - I thought it had stricter semantics for atomics regardless. Very very handwavy on my part at this point.

enough-analyst-54434

12/27/2021, 9:53 PM

Ok async_oncecell appears to do the right things: 1. check: https://github.com/Yoeori/async-oncecell/blob/main/src/lib.rs#L93 2. double check: https://github.com/Yoeori/async-oncecell/blob/main/src/lib.rs#L111

enough-analyst-54434

12/27/2021, 9:54 PM

That said - that crate, though new / active begs off production use.

enough-analyst-54434

12/27/2021, 9:55 PM

How can double checked locking bugs still be a thing!

witty-crayon-22786

12/27/2021, 9:55 PM

sigh.

enough-analyst-54434

12/27/2021, 9:56 PM

My intro was this article in the way back: https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

👍 1

witty-crayon-22786

12/27/2021, 9:56 PM

i’ve only ever used orderings as conservatively as possible, because i don’t understand them.

witty-crayon-22786

12/27/2021, 9:57 PM

the

tokio

folks have a library/framework to check all interleavings though: https://tokio.rs/blog/2019-10-scheduler#fearless-unsafe-concurrency-with-loom

enough-analyst-54434

12/27/2021, 9:57 PM

Yeah, I'm pretty shaky there too. Maybe its not too hard to flag in a switch between those two cell impls and I can flag the new one on in CI for a while? Or maybe there is just a simpler / slightly less efficient way to structure our code.

witty-crayon-22786

12/27/2021, 10:06 PM

another option would be to add an assert that the destination doesn’t already exist, to validate the hypothesis? incompatible with “doing cleanup if we fail to materialize”, but. given the “failed is valid, but needs to cleanup” case, it seems like maybe rather than a cell, this should* be an async Mutex with a state machine inside

enough-analyst-54434

12/27/2021, 10:08 PM

Yeah, I need to break here for a few hours but I'll return to this more wholistically this evening. I'd like to squash this dead and move on.

witty-crayon-22786

12/27/2021, 10:09 PM

ok, thanks a lot. i’m tied up this evening, but will take a look if i can later, or tomorrow if not

enough-analyst-54434

12/28/2021, 3:34 AM

Loom is great. Unfortunately it explicitly does not / cannot handle Ordering::Relaxed. I rigged u a quick loom test of the dcl code in parking_lot that fails correctly if the double-check writes with Ordering::Acquire for example, but - yeah, unfortunately not a tool that can be used for this case.

enough-analyst-54434

12/28/2021, 3:35 AM

I'll try jut using normal code and hammering it to see if I can trigger a broken dcl event.

enough-analyst-54434

12/28/2021, 4:10 AM

Actually - although I am curious to see if the dcl is broken - is a Pex / atomic_directory style lock and population of a permanent ~/.cache/pants/immutable_caches dir desirable? Right now we re-do unpacking from LMDB on every pantsd restart. The good side is auto-gc of course and no waiting on the cache goal ticket. The bad side is redoing this every time and ... potentially dcl bugs.

enough-analyst-54434

12/28/2021, 4:13 AM

Tom's initial issue was effectively not having an atomic_directory construct / API in the rules.

witty-crayon-22786

12/28/2021, 5:31 PM

i’ll get out a sanity check to confirm the hypothesis shortly, unless you already have something pending

witty-crayon-22786

12/28/2021, 5:36 PM

https://github.com/pantsbuild/pants/pull/14013

enough-analyst-54434

12/28/2021, 5:47 PM

Let's assume that confirms, do you have an opinion on fix dcl vs switch to atomic directory (which kills 2 birds re population failure cleanup)? Further do you have an opinion on the location of an atomic dir if you're ok with that idea - ephemeral like today vs in ~/.cache/pants?

witty-crayon-22786

12/28/2021, 5:50 PM

I would rather not add something that needs cleanup without also adding the cleanup. The other reason we left them ephemeral was to avoid needing to worry that the digest had been mutated and gone out of sync run-over-run.

witty-crayon-22786

12/28/2021, 5:52 PM

But would we really want to rely on atomic renames to prevent N threads in one process from materializing stop one another? You'd still want (file) locking I presume

enough-analyst-54434

12/28/2021, 5:53 PM

Yes, you'd want file locking. That's how it's done over in Pex for most cases (bool flag).

enough-analyst-54434

12/28/2021, 5:55 PM

I buy the cleanup bit. The digest mutated bit sounds unusefully paranoid. I'm probably not seeing some case you are? I'd think if our digesting was broken then way more would be broken too.

witty-crayon-22786

12/28/2021, 5:56 PM

Less of a concern since we've bothered to make the directory structure readonly I suppose. Code would have to override that (although some definitely does)

enough-analyst-54434

12/28/2021, 5:58 PM

Ok, so ... this is getting painful. Go or no go. I'm stepping in to fix bleeding but if you have opinions how it should be done I can step back.

witty-crayon-22786

12/28/2021, 6:00 PM

Do whatever you want. I always "have an opinion", but you know what you're about.

enough-analyst-54434

12/28/2021, 9:23 PM

I stuck with the userspace in-mem locking since the interprocess locking only makes sense if/when we go to storing this stuff in a stable

~/.cache/pants/...

dir: https://github.com/pantsbuild/pants/pull/14016

enough-analyst-54434

12/29/2021, 3:06 AM

And that didn't work at all. I won't be hacking here for a while so I've released the issue back to the pool :/

enough-analyst-54434

01/07/2022, 11:53 PM

https://github.com/pantsbuild/pants/pull/14016 should be good to go now.

witty-crayon-22786

01/07/2022, 11:55 PM

thanks!

witty-crayon-22786

01/07/2022, 11:55 PM

is the description up to date?

enough-analyst-54434

01/08/2022, 12:26 AM

Yes. The macOS perms weirdness just made it look like the fix didn't work since it resulted in a different-reasoned perm error - although crucially - every single time, no ~30%.

witty-crayon-22786

01/08/2022, 12:26 AM

got it. thanks!

Open in Slack

Previous Next