<@U06A03HV1> you mentioned having thoughts on remo...
# development
b
@witty-crayon-22786 you mentioned having thoughts on remote caching https://github.com/pantsbuild/pants/issues/11149 and specifically concerns about fine-grained cache requests due to individually caching each entry in a directory tree for non-REAPI implementations, like S3 or GHA. I did a quick bit of research into some of the other systems: • Bazel and Gradle seem to do fine-grained request-per-entry-in-tree in their HTTP caches • Lage and Rush seem to tar up the entire output of each command, rather than handling trees • I note that, AIUI, Pants' REAPI currently does a gRPC request per file (https://github.com/pantsbuild/pants/issues/6990) What were your thoughts? How best to have this discussion?
w
my only concrete thought was that a middle ground is to use the remexec
Tree
protobuf (which we have some code to serialize/deserialize already). a
Tree
is an entire
Directory
, recursively: it contains a root, and then all of the reachable `Directory`s. files are still digests, referenced from the directories
so the contents of the store would be 1) file blobs, 2) `Tree`s referencing files
cc @fast-nail-55400
b
ah, I see, so rather than having to recursively/sequentially query
a/
then
a/b/
then
a/b/c/
to finally get
a/b/c/file1.txt
, one could get the whole tree, and then straight away
a/b/c/file.txt
?
w
right
b
Interesting. That certainly seems like it'd help deep trees! I could imagine two routes to implementing it: 1. Refactor to prep for storing trees, and then implement non-REAPI caches using them 2. Implement non-REAPI caches, and then optimise by swapping to storing trees I'm inclined to 2 as it gets value for users sooner. Although, maybe it'd be too slow(?), make the refactoring harder, and... switching to trees would invalidate all the existing directory cache entries (but these are generally small, I imagine). Thoughts?
w
yea, agreed. unless you’ve implemented both, it won’t be clear how much of an improvement trees are, and in that case, 2 is probably best to do first for the reason you mentioned.
👍 1
b
Just brainstorming/taking it one step further: I wonder if one could then start batching small files in a tree to reduce overhead even more, e.g. "intelligently" group files less than X bytes into batches of up to Y files or Z bytes (for some sensible thresholds), and upload/download them as one blob rather than individually. That seems like it'd be pretty subtle, e.g. • read and write need to be able to both decide on the right hash for addressing the node • the reduce-requests vs. reduce-invalidation trade-off might be tricky to balance.
f
For REAPI, we mostly certain can do better by sending multiple blob requests in a single BatchReadBlobs RPC request. We only send one blob request per BatchReadBlobs currently because that code originally sent a higher overhead streaming request even for small files, and I just refactored to switch to the BatchReadBlobs request.
So the natural next refactor would be to merge multiple blob read requests into a single gRPC request
That could give performance benefits without needing to revisit the caching architecture's domain models.
Fine-grained caching is supposed to enable performant processing if only parts of a tree changing, whereas if a file changes but we store as an archive or other "bulk container", then the whole container needs to change.
b
Yeah, I think that's #6990. Unfortunately, I don't think it helps directly with simpler non-REAPI caches (which is my short term focus), which only support one-blob-per-request entirely, to a first order.
f
Why not? The blob's digest could be used as a key into the non-REAPI cache.
When I was at Toolchain, we for a time had a Redis-based object store backing an REAPI cache API. Same principle applies to a HTTP-based cache too.
content-addressable storage != REAPI
although REAPI is an impl of the idea
b
I'm assuming the only operations are (something equivalent to)
GET {digest}
and
PUT {digest}
, so there's no way to meaningfully benefit from "please get
digest1
,
digest2
,
digest3
all togther" other than... running 3 requests
GET {digest1}
,
GET {digest2}
,
GET {digest3}
, in parallel at least, but pants can already do that.
f
The third fundamental operation should be "EXISTS {digest}"
Well if the web server is HTTP/2, multiple requests could be in-flight at the same time.
(REAPI calls that third operation FindMissingBlobs and is a core contributor to making the protocol performant so clients can avoid writes.)
👍 1
(and indeed exists is better phrased as "FIND_MISSING_BLOBS digest1, digest2, digest3, ...")
b
yeah, so, my idle musing is about getting some of the benefit of gRPC `BatchUpdateBlobs`/`BatchReadBlobs` for these simple `GET`/`PUT`/`HEAD`(?) caches, to not have to do one request per file when there's a pile of small ones, by batching just those tiny files into slightly larger batches... but definitely not going all the way to "everything in one archive" like Lage/Rush. Similar to working with `Tree`s, this seems like something that can just be left for the future, though.
(and indeed exists is better phrased as "FIND_MISSING_BLOBS digest1, digest2, digest3, ...")
This is likely also something that doesn't generalise well for non-REAPI caches, e.g. often one might have only
HEAD {digest}
without batching; at least cheaper then a full
GET
or
PUT
though?