Can someone help me understand the exact semantics of a proc Pants #development

Can someone help me understand the exact semantics...

bitter-ability-32190

07/22/2022, 3:48 PM

Can someone help me understand the exact semantics of a processes

append-only-caches

👀 1

bitter-ability-32190

07/22/2022, 3:48 PM

Is the process allowed to edit files that exist there?

enough-analyst-54434

07/22/2022, 3:56 PM

No. You can add files only (thus append).

enough-analyst-54434

07/22/2022, 3:56 PM

And adding must be atomic - Pex has a good deal of infra to ensure that bit amongst racing processes.

bitter-ability-32190

07/22/2022, 3:57 PM

Do we have an equivalent way to expose a "cache" dir for writing cache files?

enough-analyst-54434

07/22/2022, 3:57 PM

Basically, you can only do things that are safe to be viewed at any point in time by any other parallel process.

enough-analyst-54434

07/22/2022, 3:59 PM

Do we have an equivalent way to expose a "cache" dir for writing cache files?

I'm not following. An equivalent to Pex's infra?

enough-analyst-54434

07/22/2022, 3:59 PM

Perhaps you want ImmutableCaches - just a sec for links...

bitter-ability-32190

07/22/2022, 4:00 PM

I want a cache dir I can write files to. This is pants-process specific, not PEX. And the tool I'm running expects to update those cache files if invalid

enough-analyst-54434

07/22/2022, 4:01 PM

immutable_input_digests

on `Process`is the only other primitive I'm aware of: https://github.com/pantsbuild/pants/blob/710ff093a390718084ecdf3f0f0d159f3e107963/src/python/pants/backend/codegen/protobuf/go/rules.py#L247

enough-analyst-54434

07/22/2022, 4:02 PM

Ah, yeah - mutation like that 0 support for at the moment.

enough-analyst-54434

07/22/2022, 4:04 PM

I think you see the difficulty. You'll have to find a way to make the mutation mult-process-safe. The only way I'm aware of is if all readers and the writer use the same posix file lock or similar to guard the mutation.

enough-analyst-54434

07/22/2022, 4:04 PM

Or else copy aside the cache privately, mutate it, then copy back atomically.

bitter-ability-32190

07/22/2022, 4:10 PM

Lots to think about I suppose

enough-analyst-54434

07/22/2022, 4:12 PM

Not being able to rely on an OS / FS does make this harder than need be. Snapshotting filesystems would be nice to leverage.

bitter-ability-32190

07/22/2022, 4:13 PM

Not to mention I don't control the tool I'm running's internals 🙈

bitter-ability-32190

07/22/2022, 4:16 PM

Tacking on as well, is there an easy way to setup/teardown a process without hand-crafting an argv?

enough-analyst-54434

07/22/2022, 4:16 PM

I'm not sure what you mean.

bitter-ability-32190

07/22/2022, 4:16 PM

I suppose I can materialize shell scripts, and call the "entry" script.

bitter-ability-32190

07/22/2022, 4:17 PM

Like a setup script (to set up the cache dir), and a teardown script (to atomically write to the append cache)

bitter-ability-32190

07/22/2022, 4:17 PM

But persist the original exit code, and ignore output from setup/teardown, etc...

enough-analyst-54434

07/22/2022, 4:18 PM

Yeah, that sounds impossible in a single argv unless it's to

bash -c ....

or a custom script.

happy-kitchen-89482

07/22/2022, 4:59 PM

There has been some discussion of mutable caches, to support the mypy cache

bitter-ability-32190

07/22/2022, 4:59 PM

to support the mypy cache

Thats actually what I'm looking into 🙂

happy-kitchen-89482

07/22/2022, 4:59 PM

So there is at least one prominent use-case for it apart from yours

happy-kitchen-89482

07/22/2022, 4:59 PM

Oh, hah

bitter-ability-32190

07/22/2022, 4:59 PM

Writing my findings on the ticket https://github.com/pantsbuild/pants/issues/10864

happy-kitchen-89482

07/22/2022, 5:00 PM

OK, so you have the context of those discussions

happy-kitchen-89482

07/22/2022, 5:00 PM

🙂

bitter-ability-32190

07/22/2022, 5:07 PM

Yeah I think some kind of FileSystem snapshotting would really be ideal for this kind of use case. Moving the onus of work from each tool and into Pants

bitter-ability-32190

07/22/2022, 5:17 PM

Semi-related. Given a digest, how "expensive" is it to ask for every file's fingerprint?

average-vr-56795

07/22/2022, 5:23 PM

Semi-related. Given a digest, how "expensive" is it to ask for every file's fingerprint?

Pretty much free

bitter-ability-32190

07/22/2022, 5:24 PM

I assumed as much. Although I think the path I'm going down is a dead-end. We'd also need to know every file in our requirements as well and this is starting to become quite the tangled mess

bitter-ability-32190

07/22/2022, 5:24 PM

It might be more feasible to add a

--pants

flag to mypy, akin to

--bazel

bitter-ability-32190

07/22/2022, 5:31 PM

(going back on my earlier statement, looks like the writing might be atomic: https://github.com/python/mypy/blob/b0e59b29be239ce35cffcef20639709259ee48df/mypy/metastore.py#L101)

witty-crayon-22786

07/22/2022, 5:42 PM

the high level qualification for

append_only_caches

is really: “the cache was designed to be 1. shared by multiple processes, 2. in multiple repos”

bitter-ability-32190

07/22/2022, 5:43 PM

Makes sense. I'd like to believe mypy is multi-process safe in its cache

witty-crayon-22786

07/22/2022, 5:43 PM

yea, 2. is my primary concern with mypy.

bitter-ability-32190

07/22/2022, 5:44 PM

Then just the fact that

--bazel

hardcodes CWD for the cache dir is basically the only issue

bitter-ability-32190

07/22/2022, 5:44 PM

Same concern with 2, but my idea was to make a different subdir in the cache per repo (somehow)

witty-crayon-22786

07/22/2022, 5:45 PM

@bitter-ability-32190: my understanding of the

--bazel

flag (and i haven’t looked at it recently) is that no caches would be necessary, because you could treat mypy like a compiler: files and compilation-results-from-dependencies in, compilation results out, etc

witty-crayon-22786

07/22/2022, 5:46 PM

i don’t know of a better name for it, but basically: separate compilation

bitter-ability-32190

07/22/2022, 5:46 PM

Those compilation results need to exist somewhere though and be re-used for incrementality tho right? I think I'm missing something

witty-crayon-22786

07/22/2022, 5:47 PM

if i’m understanding correctly, they’re captured as the outputs of having compiled/checked a file, and then fed into the dependees’ check

bitter-ability-32190

07/22/2022, 5:48 PM

Ohhhhhhh I see. And I think I understand. And I understand why you want to treat mypy like a compiler and check iteratively (recursively but same diff)

witty-crayon-22786

07/22/2022, 5:50 PM

yea. it can actually be remotely executed* (can always be remotely cached) without support for append-only-caches on a cluster.

bitter-ability-32190

07/22/2022, 5:50 PM

So yes, and no.

--bazel

basically outputs the same cache as always, just tweaked for sandbox's sake. I think one important thing here is that

mypy

will output the "artifact" for our file and every file it depends on, including stdlib. So might get very busy for a single "compile". This is hurting my brain though, we're reaching the limit of used brain cells

witty-crayon-22786

07/22/2022, 5:52 PM

the digest merging of inputs would be … tricky. but yea, if you merge the caches from multiple inputs, and ignore collisions (?) then you’d end up with all depedencies having cache entries

bitter-ability-32190

07/22/2022, 5:52 PM

Yeah. Would there even be collisions?

bitter-ability-32190

07/22/2022, 5:52 PM

There'd be duplicates for sure. But collisisions?

witty-crayon-22786

07/22/2022, 5:52 PM

(currently digest merging fails for non-equal duplicate filenames: but in this case, the stdlib will end up being checked multiple times potentially)

witty-crayon-22786

07/22/2022, 5:53 PM

(…possibly all thirdparty code, actually)

bitter-ability-32190

07/22/2022, 5:54 PM

Yeah if you're "compiling" and none of your dependencies have "compiled" a 3rdparty or stdlib you'll do the work. Even though the work might've been done before by another file in a similar fashoin

witty-crayon-22786

07/22/2022, 5:54 PM

right. and that would result in (harmless) collisions in the cache

witty-crayon-22786

07/22/2022, 5:54 PM

(harmless: famous last word)

witty-crayon-22786

07/22/2022, 5:55 PM

but yea, an option to allow digest merging to optionally succeed in that case would be easy to implement

bitter-ability-32190

07/22/2022, 5:55 PM

I think you'll hit cache duplicates all over the place, lol.I think the real "pain" here is duplicate work

witty-crayon-22786

07/22/2022, 5:56 PM

my guess is that the (process) cache hitrate and parallelism more than make up for it.

witty-crayon-22786

07/22/2022, 5:57 PM

but since mypy wasn’t designed for separate compilation from the start, who knows.

bitter-ability-32190

07/22/2022, 5:57 PM

Yeah but thats not cranking it to 11

bitter-ability-32190

07/22/2022, 5:58 PM

Alright I think Stu has all the pieces, this has been fun 😛

witty-crayon-22786

07/22/2022, 5:59 PM

https://github.com/pantsbuild/pants/issues/10864#issuecomment-999863710 talks about the parallels with

jvm

compilation: would be happy to help pair on it if you’re interested in tackling!

witty-crayon-22786

07/22/2022, 6:00 PM

…as mentioned there, https://github.com/pantsbuild/pants/issues/14070 might be good to do first

bitter-ability-32190

07/22/2022, 6:01 PM

I'm a little wondering if we could combine this with a dedicated named cache dir. That could help "bootstrap" the cache for the 2rdparty stdlib 🤔

witty-crayon-22786

07/22/2022, 6:03 PM

not sure how to make mypy safe for concurrent use from multiple repos. i.e. 2 from: https://pantsbuild.slack.com/archives/C0D7TNJHL/p1658511779978859?thread_ts=1658504921.916949&cid=C0D7TNJHL

witty-crayon-22786

07/22/2022, 6:03 PM

the cache key is the filename. if you have the same filename with different contents in two copies of the repository, then you’d overwrite it with the wrong content.

witty-crayon-22786

07/22/2022, 6:04 PM

…could prefix the location in the named cache with the absolute path of the repo maybe

bitter-ability-32190

07/22/2022, 6:09 PM

Yeah that's I suggested. Different subdir per repo

bitter-ability-32190

07/22/2022, 6:11 PM

What's interesting too is that mypy has an "interface hash" for each file. Almost makes you wonder if we could leverage that for file invalidation w.r.t. dependencies 🤔

witty-crayon-22786

07/22/2022, 6:32 PM

Almost makes you wonder if we could leverage that for file invalidation w.r.t. dependencies 🤔

yea, totally. to be clear though, that should be the effect of feeding a check only the cache and not files themselves for dependencies

witty-crayon-22786

07/22/2022, 6:33 PM

i.e. the file content might have changed, but because its cache entry / interface did not, you can hit for dependees

✅ 1

witty-crayon-22786

07/22/2022, 6:33 PM

…feeding a check only the cache and not files themselves for dependencies

assuming mypy allows that. 🤞

witty-crayon-22786

07/22/2022, 6:34 PM

--bazel

mode doesn’t, it should.

Open in Slack

Previous Next