https://pantsbuild.org/ logo
#development
Title
# development
b

bitter-ability-32190

07/22/2022, 3:48 PM
Can someone help me understand the exact semantics of a processes
append-only-caches
?
👀 1
Is the process allowed to edit files that exist there?
e

enough-analyst-54434

07/22/2022, 3:56 PM
No. You can add files only (thus append).
And adding must be atomic - Pex has a good deal of infra to ensure that bit amongst racing processes.
b

bitter-ability-32190

07/22/2022, 3:57 PM
Do we have an equivalent way to expose a "cache" dir for writing cache files?
e

enough-analyst-54434

07/22/2022, 3:57 PM
Basically, you can only do things that are safe to be viewed at any point in time by any other parallel process.
Do we have an equivalent way to expose a "cache" dir for writing cache files?
I'm not following. An equivalent to Pex's infra?
Perhaps you want ImmutableCaches - just a sec for links...
b

bitter-ability-32190

07/22/2022, 4:00 PM
I want a cache dir I can write files to. This is pants-process specific, not PEX. And the tool I'm running expects to update those cache files if invalid
e

enough-analyst-54434

07/22/2022, 4:01 PM
Ah, yeah - mutation like that 0 support for at the moment.
I think you see the difficulty. You'll have to find a way to make the mutation mult-process-safe. The only way I'm aware of is if all readers and the writer use the same posix file lock or similar to guard the mutation.
Or else copy aside the cache privately, mutate it, then copy back atomically.
b

bitter-ability-32190

07/22/2022, 4:10 PM
Lots to think about I suppose
e

enough-analyst-54434

07/22/2022, 4:12 PM
Not being able to rely on an OS / FS does make this harder than need be. Snapshotting filesystems would be nice to leverage.
b

bitter-ability-32190

07/22/2022, 4:13 PM
Not to mention I don't control the tool I'm running's internals 🙈
Tacking on as well, is there an easy way to setup/teardown a process without hand-crafting an argv?
e

enough-analyst-54434

07/22/2022, 4:16 PM
I'm not sure what you mean.
b

bitter-ability-32190

07/22/2022, 4:16 PM
I suppose I can materialize shell scripts, and call the "entry" script.
Like a setup script (to set up the cache dir), and a teardown script (to atomically write to the append cache)
But persist the original exit code, and ignore output from setup/teardown, etc...
e

enough-analyst-54434

07/22/2022, 4:18 PM
Yeah, that sounds impossible in a single argv unless it's to
bash -c ....
or a custom script.
h

happy-kitchen-89482

07/22/2022, 4:59 PM
There has been some discussion of mutable caches, to support the mypy cache
b

bitter-ability-32190

07/22/2022, 4:59 PM
to support the mypy cache
Thats actually what I'm looking into 🙂
h

happy-kitchen-89482

07/22/2022, 4:59 PM
So there is at least one prominent use-case for it apart from yours
Oh, hah
b

bitter-ability-32190

07/22/2022, 4:59 PM
Writing my findings on the ticket https://github.com/pantsbuild/pants/issues/10864
h

happy-kitchen-89482

07/22/2022, 5:00 PM
OK, so you have the context of those discussions
🙂
b

bitter-ability-32190

07/22/2022, 5:07 PM
Yeah I think some kind of FileSystem snapshotting would really be ideal for this kind of use case. Moving the onus of work from each tool and into Pants
Semi-related. Given a digest, how "expensive" is it to ask for every file's fingerprint?
a

average-vr-56795

07/22/2022, 5:23 PM
Semi-related. Given a digest, how "expensive" is it to ask for every file's fingerprint?
Pretty much free
b

bitter-ability-32190

07/22/2022, 5:24 PM
I assumed as much. Although I think the path I'm going down is a dead-end. We'd also need to know every file in our requirements as well and this is starting to become quite the tangled mess
It might be more feasible to add a
--pants
flag to mypy, akin to
--bazel
(going back on my earlier statement, looks like the writing might be atomic: https://github.com/python/mypy/blob/b0e59b29be239ce35cffcef20639709259ee48df/mypy/metastore.py#L101)
w

witty-crayon-22786

07/22/2022, 5:42 PM
the high level qualification for
append_only_caches
is really: “the cache was designed to be 1. shared by multiple processes, 2. in multiple repos”
b

bitter-ability-32190

07/22/2022, 5:43 PM
Makes sense. I'd like to believe mypy is multi-process safe in its cache
w

witty-crayon-22786

07/22/2022, 5:43 PM
yea, 2. is my primary concern with mypy.
b

bitter-ability-32190

07/22/2022, 5:44 PM
Then just the fact that
--bazel
hardcodes CWD for the cache dir is basically the only issue
Same concern with 2, but my idea was to make a different subdir in the cache per repo (somehow)
w

witty-crayon-22786

07/22/2022, 5:45 PM
@bitter-ability-32190: my understanding of the
--bazel
flag (and i haven’t looked at it recently) is that no caches would be necessary, because you could treat mypy like a compiler: files and compilation-results-from-dependencies in, compilation results out, etc
i don’t know of a better name for it, but basically: separate compilation
b

bitter-ability-32190

07/22/2022, 5:46 PM
Those compilation results need to exist somewhere though and be re-used for incrementality tho right? I think I'm missing something
w

witty-crayon-22786

07/22/2022, 5:47 PM
if i’m understanding correctly, they’re captured as the outputs of having compiled/checked a file, and then fed into the dependees’ check
b

bitter-ability-32190

07/22/2022, 5:48 PM
Ohhhhhhh I see. And I think I understand. And I understand why you want to treat mypy like a compiler and check iteratively (recursively but same diff)
w

witty-crayon-22786

07/22/2022, 5:50 PM
yea. it can actually be remotely executed* (can always be remotely cached) without support for append-only-caches on a cluster.
b

bitter-ability-32190

07/22/2022, 5:50 PM
So yes, and no.
--bazel
basically outputs the same cache as always, just tweaked for sandbox's sake. I think one important thing here is that
mypy
will output the "artifact" for our file and every file it depends on, including stdlib. So might get very busy for a single "compile". This is hurting my brain though, we're reaching the limit of used brain cells
w

witty-crayon-22786

07/22/2022, 5:52 PM
the digest merging of inputs would be … tricky. but yea, if you merge the caches from multiple inputs, and ignore collisions (?) then you’d end up with all depedencies having cache entries
b

bitter-ability-32190

07/22/2022, 5:52 PM
Yeah. Would there even be collisions?
There'd be duplicates for sure. But collisisions?
w

witty-crayon-22786

07/22/2022, 5:52 PM
(currently digest merging fails for non-equal duplicate filenames: but in this case, the stdlib will end up being checked multiple times potentially)
(…possibly all thirdparty code, actually)
b

bitter-ability-32190

07/22/2022, 5:54 PM
Yeah if you're "compiling" and none of your dependencies have "compiled" a 3rdparty or stdlib you'll do the work. Even though the work might've been done before by another file in a similar fashoin
w

witty-crayon-22786

07/22/2022, 5:54 PM
right. and that would result in (harmless) collisions in the cache
(harmless: famous last word)
but yea, an option to allow digest merging to optionally succeed in that case would be easy to implement
b

bitter-ability-32190

07/22/2022, 5:55 PM
I think you'll hit cache duplicates all over the place, lol.I think the real "pain" here is duplicate work
w

witty-crayon-22786

07/22/2022, 5:56 PM
my guess is that the (process) cache hitrate and parallelism more than make up for it.
but since mypy wasn’t designed for separate compilation from the start, who knows.
b

bitter-ability-32190

07/22/2022, 5:57 PM
Yeah but thats not cranking it to 11
Alright I think Stu has all the pieces, this has been fun 😛
w

witty-crayon-22786

07/22/2022, 5:59 PM
https://github.com/pantsbuild/pants/issues/10864#issuecomment-999863710 talks about the parallels with
jvm
compilation: would be happy to help pair on it if you’re interested in tackling!
…as mentioned there, https://github.com/pantsbuild/pants/issues/14070 might be good to do first
b

bitter-ability-32190

07/22/2022, 6:01 PM
I'm a little wondering if we could combine this with a dedicated named cache dir. That could help "bootstrap" the cache for the 2rdparty stdlib 🤔
w

witty-crayon-22786

07/22/2022, 6:03 PM
not sure how to make mypy safe for concurrent use from multiple repos. i.e. 2 from: https://pantsbuild.slack.com/archives/C0D7TNJHL/p1658511779978859?thread_ts=1658504921.916949&cid=C0D7TNJHL
the cache key is the filename. if you have the same filename with different contents in two copies of the repository, then you’d overwrite it with the wrong content.
…could prefix the location in the named cache with the absolute path of the repo maybe
b

bitter-ability-32190

07/22/2022, 6:09 PM
Yeah that's I suggested. Different subdir per repo
What's interesting too is that mypy has an "interface hash" for each file. Almost makes you wonder if we could leverage that for file invalidation w.r.t. dependencies 🤔
w

witty-crayon-22786

07/22/2022, 6:32 PM
Almost makes you wonder if we could leverage that for file invalidation w.r.t. dependencies 🤔
yea, totally. to be clear though, that should be the effect of feeding a check only the cache and not files themselves for dependencies
i.e. the file content might have changed, but because its cache entry / interface did not, you can hit for dependees
1
…feeding a check only the cache and not files themselves for dependencies
assuming mypy allows that. 🤞
if
--bazel
mode doesn’t, it should.