Can someone help me understand the exact semantics...
# development
b
Can someone help me understand the exact semantics of a processes
append-only-caches
?
👀 1
Is the process allowed to edit files that exist there?
e
No. You can add files only (thus append).
And adding must be atomic - Pex has a good deal of infra to ensure that bit amongst racing processes.
b
Do we have an equivalent way to expose a "cache" dir for writing cache files?
e
Basically, you can only do things that are safe to be viewed at any point in time by any other parallel process.
Do we have an equivalent way to expose a "cache" dir for writing cache files?
I'm not following. An equivalent to Pex's infra?
Perhaps you want ImmutableCaches - just a sec for links...
b
I want a cache dir I can write files to. This is pants-process specific, not PEX. And the tool I'm running expects to update those cache files if invalid
e
Ah, yeah - mutation like that 0 support for at the moment.
I think you see the difficulty. You'll have to find a way to make the mutation mult-process-safe. The only way I'm aware of is if all readers and the writer use the same posix file lock or similar to guard the mutation.
Or else copy aside the cache privately, mutate it, then copy back atomically.
b
Lots to think about I suppose
e
Not being able to rely on an OS / FS does make this harder than need be. Snapshotting filesystems would be nice to leverage.
b
Not to mention I don't control the tool I'm running's internals 🙈
Tacking on as well, is there an easy way to setup/teardown a process without hand-crafting an argv?
e
I'm not sure what you mean.
b
I suppose I can materialize shell scripts, and call the "entry" script.
Like a setup script (to set up the cache dir), and a teardown script (to atomically write to the append cache)
But persist the original exit code, and ignore output from setup/teardown, etc...
e
Yeah, that sounds impossible in a single argv unless it's to
bash -c ....
or a custom script.
h
There has been some discussion of mutable caches, to support the mypy cache
b
to support the mypy cache
Thats actually what I'm looking into 🙂
h
So there is at least one prominent use-case for it apart from yours
Oh, hah
b
Writing my findings on the ticket https://github.com/pantsbuild/pants/issues/10864
h
OK, so you have the context of those discussions
🙂
b
Yeah I think some kind of FileSystem snapshotting would really be ideal for this kind of use case. Moving the onus of work from each tool and into Pants
Semi-related. Given a digest, how "expensive" is it to ask for every file's fingerprint?
a
Semi-related. Given a digest, how "expensive" is it to ask for every file's fingerprint?
Pretty much free
b
I assumed as much. Although I think the path I'm going down is a dead-end. We'd also need to know every file in our requirements as well and this is starting to become quite the tangled mess
It might be more feasible to add a
--pants
flag to mypy, akin to
--bazel
(going back on my earlier statement, looks like the writing might be atomic: https://github.com/python/mypy/blob/b0e59b29be239ce35cffcef20639709259ee48df/mypy/metastore.py#L101)
w
the high level qualification for
append_only_caches
is really: “the cache was designed to be 1. shared by multiple processes, 2. in multiple repos”
b
Makes sense. I'd like to believe mypy is multi-process safe in its cache
w
yea, 2. is my primary concern with mypy.
b
Then just the fact that
--bazel
hardcodes CWD for the cache dir is basically the only issue
Same concern with 2, but my idea was to make a different subdir in the cache per repo (somehow)
w
@bitter-ability-32190: my understanding of the
--bazel
flag (and i haven’t looked at it recently) is that no caches would be necessary, because you could treat mypy like a compiler: files and compilation-results-from-dependencies in, compilation results out, etc
i don’t know of a better name for it, but basically: separate compilation
b
Those compilation results need to exist somewhere though and be re-used for incrementality tho right? I think I'm missing something
w
if i’m understanding correctly, they’re captured as the outputs of having compiled/checked a file, and then fed into the dependees’ check
b
Ohhhhhhh I see. And I think I understand. And I understand why you want to treat mypy like a compiler and check iteratively (recursively but same diff)
w
yea. it can actually be remotely executed* (can always be remotely cached) without support for append-only-caches on a cluster.
b
So yes, and no.
--bazel
basically outputs the same cache as always, just tweaked for sandbox's sake. I think one important thing here is that
mypy
will output the "artifact" for our file and every file it depends on, including stdlib. So might get very busy for a single "compile". This is hurting my brain though, we're reaching the limit of used brain cells
w
the digest merging of inputs would be … tricky. but yea, if you merge the caches from multiple inputs, and ignore collisions (?) then you’d end up with all depedencies having cache entries
b
Yeah. Would there even be collisions?
There'd be duplicates for sure. But collisisions?
w
(currently digest merging fails for non-equal duplicate filenames: but in this case, the stdlib will end up being checked multiple times potentially)
(…possibly all thirdparty code, actually)
b
Yeah if you're "compiling" and none of your dependencies have "compiled" a 3rdparty or stdlib you'll do the work. Even though the work might've been done before by another file in a similar fashoin
w
right. and that would result in (harmless) collisions in the cache
(harmless: famous last word)
but yea, an option to allow digest merging to optionally succeed in that case would be easy to implement
b
I think you'll hit cache duplicates all over the place, lol.I think the real "pain" here is duplicate work
w
my guess is that the (process) cache hitrate and parallelism more than make up for it.
but since mypy wasn’t designed for separate compilation from the start, who knows.
b
Yeah but thats not cranking it to 11
Alright I think Stu has all the pieces, this has been fun 😛
w
https://github.com/pantsbuild/pants/issues/10864#issuecomment-999863710 talks about the parallels with
jvm
compilation: would be happy to help pair on it if you’re interested in tackling!
…as mentioned there, https://github.com/pantsbuild/pants/issues/14070 might be good to do first
b
I'm a little wondering if we could combine this with a dedicated named cache dir. That could help "bootstrap" the cache for the 2rdparty stdlib 🤔
w
not sure how to make mypy safe for concurrent use from multiple repos. i.e. 2 from: https://pantsbuild.slack.com/archives/C0D7TNJHL/p1658511779978859?thread_ts=1658504921.916949&cid=C0D7TNJHL
the cache key is the filename. if you have the same filename with different contents in two copies of the repository, then you’d overwrite it with the wrong content.
…could prefix the location in the named cache with the absolute path of the repo maybe
b
Yeah that's I suggested. Different subdir per repo
What's interesting too is that mypy has an "interface hash" for each file. Almost makes you wonder if we could leverage that for file invalidation w.r.t. dependencies 🤔
w
Almost makes you wonder if we could leverage that for file invalidation w.r.t. dependencies 🤔
yea, totally. to be clear though, that should be the effect of feeding a check only the cache and not files themselves for dependencies
i.e. the file content might have changed, but because its cache entry / interface did not, you can hit for dependees
1
…feeding a check only the cache and not files themselves for dependencies
assuming mypy allows that. 🤞
if
--bazel
mode doesn’t, it should.