< witty crayon 22786> you mentioned having thoughts on remot Pants #development

<@U06A03HV1> you mentioned having thoughts on remo...

broad-processor-92400

05/23/2023, 2:32 AM

@witty-crayon-22786 you mentioned having thoughts on remote caching https://github.com/pantsbuild/pants/issues/11149 and specifically concerns about fine-grained cache requests due to individually caching each entry in a directory tree for non-REAPI implementations, like S3 or GHA. I did a quick bit of research into some of the other systems: • Bazel and Gradle seem to do fine-grained request-per-entry-in-tree in their HTTP caches • Lage and Rush seem to tar up the entire output of each command, rather than handling trees • I note that, AIUI, Pants' REAPI currently does a gRPC request per file (https://github.com/pantsbuild/pants/issues/6990) What were your thoughts? How best to have this discussion?

witty-crayon-22786

05/23/2023, 3:33 AM

my only concrete thought was that a middle ground is to use the remexec

Tree

protobuf (which we have some code to serialize/deserialize already). a

Tree

is an entire

Directory

, recursively: it contains a root, and then all of the reachable `Directory`s. files are still digests, referenced from the directories

witty-crayon-22786

05/23/2023, 3:33 AM

so the contents of the store would be 1) file blobs, 2) `Tree`s referencing files

witty-crayon-22786

05/23/2023, 3:34 AM

cc @fast-nail-55400

broad-processor-92400

05/23/2023, 3:36 AM

ah, I see, so rather than having to recursively/sequentially query

a/

then

a/b/

then

a/b/c/

to finally get

a/b/c/file1.txt

, one could get the whole tree, and then straight away

a/b/c/file.txt

witty-crayon-22786

05/23/2023, 3:36 AM

right

broad-processor-92400

05/23/2023, 3:40 AM

Interesting. That certainly seems like it'd help deep trees! I could imagine two routes to implementing it: 1. Refactor to prep for storing trees, and then implement non-REAPI caches using them 2. Implement non-REAPI caches, and then optimise by swapping to storing trees I'm inclined to 2 as it gets value for users sooner. Although, maybe it'd be too slow(?), make the refactoring harder, and... switching to trees would invalidate all the existing directory cache entries (but these are generally small, I imagine). Thoughts?

witty-crayon-22786

05/23/2023, 3:44 AM

yea, agreed. unless you’ve implemented both, it won’t be clear how much of an improvement trees are, and in that case, 2 is probably best to do first for the reason you mentioned.

👍 1

broad-processor-92400

05/23/2023, 3:58 AM

Just brainstorming/taking it one step further: I wonder if one could then start batching small files in a tree to reduce overhead even more, e.g. "intelligently" group files less than X bytes into batches of up to Y files or Z bytes (for some sensible thresholds), and upload/download them as one blob rather than individually. That seems like it'd be pretty subtle, e.g. • read and write need to be able to both decide on the right hash for addressing the node • the reduce-requests vs. reduce-invalidation trade-off might be tricky to balance.

fast-nail-55400

05/23/2023, 10:45 AM

For REAPI, we mostly certain can do better by sending multiple blob requests in a single BatchReadBlobs RPC request. We only send one blob request per BatchReadBlobs currently because that code originally sent a higher overhead streaming request even for small files, and I just refactored to switch to the BatchReadBlobs request.

fast-nail-55400

05/23/2023, 10:46 AM

So the natural next refactor would be to merge multiple blob read requests into a single gRPC request

fast-nail-55400

05/23/2023, 10:48 AM

That could give performance benefits without needing to revisit the caching architecture's domain models.

fast-nail-55400

05/23/2023, 10:49 AM

Fine-grained caching is supposed to enable performant processing if only parts of a tree changing, whereas if a file changes but we store as an archive or other "bulk container", then the whole container needs to change.

broad-processor-92400

05/23/2023, 10:49 AM

Yeah, I think that's #6990. Unfortunately, I don't think it helps directly with simpler non-REAPI caches (which is my short term focus), which only support one-blob-per-request entirely, to a first order.

fast-nail-55400

05/23/2023, 10:50 AM

Why not? The blob's digest could be used as a key into the non-REAPI cache.

fast-nail-55400

05/23/2023, 10:51 AM

When I was at Toolchain, we for a time had a Redis-based object store backing an REAPI cache API. Same principle applies to a HTTP-based cache too.

fast-nail-55400

05/23/2023, 10:51 AM

content-addressable storage != REAPI

fast-nail-55400

05/23/2023, 10:51 AM

although REAPI is an impl of the idea

broad-processor-92400

05/23/2023, 10:52 AM

I'm assuming the only operations are (something equivalent to)

GET {digest}

and

PUT {digest}

, so there's no way to meaningfully benefit from "please get

digest1

digest2

digest3

all togther" other than... running 3 requests

GET {digest1}

GET {digest2}

GET {digest3}

, in parallel at least, but pants can already do that.

fast-nail-55400

05/23/2023, 10:53 AM

The third fundamental operation should be "EXISTS {digest}"

fast-nail-55400

05/23/2023, 10:53 AM

Well if the web server is HTTP/2, multiple requests could be in-flight at the same time.

fast-nail-55400

05/23/2023, 10:54 AM

(REAPI calls that third operation FindMissingBlobs and is a core contributor to making the protocol performant so clients can avoid writes.)

👍 1

fast-nail-55400

05/23/2023, 10:55 AM

(and indeed exists is better phrased as "FIND_MISSING_BLOBS digest1, digest2, digest3, ...")

broad-processor-92400

05/23/2023, 11:04 AM

yeah, so, my idle musing is about getting some of the benefit of gRPC `BatchUpdateBlobs`/`BatchReadBlobs` for these simple `GET`/`PUT`/`HEAD`(?) caches, to not have to do one request per file when there's a pile of small ones, by batching just those tiny files into slightly larger batches... but definitely not going all the way to "everything in one archive" like Lage/Rush. Similar to working with `Tree`s, this seems like something that can just be left for the future, though.

broad-processor-92400

05/23/2023, 11:06 AM

(and indeed exists is better phrased as "FIND_MISSING_BLOBS digest1, digest2, digest3, ...")

This is likely also something that doesn't generalise well for non-REAPI caches, e.g. often one might have only

HEAD {digest}

without batching; at least cheaper then a full

GET

PUT

though?

2 Views

Open in Slack

Previous Next