since we have a few “optimize rust intrinsic X for...
# development
w
since we have a few “optimize rust intrinsic X for Snapshots” items coming up right now (and with the caveat that this is “optimizing before having profiled”), i should mention that way back in the day, @average-vr-56795 and i discussed having an in-memory-optimized form of
DirectoryDigest
/
Snapshot
, which would effectively be a
HashMap<PathBuf, (File|Directory)>
to make those kinds of functions easier to write. currently we lazily load from the database, but i think that that has made a lot of things harder to implement efficiently.
on the other hand, it could totally be something else that is the bottleneck.
(certainly for
materialize_directory
)
cc @enough-analyst-54434 ^
h
The only intrinsic that seems really slow from my profiling is `SnapshotSubset`: https://github.com/pantsbuild/pants/issues/9706 The two other optimizations I did are entirely on the Python side to simply avoid unnecessary work
w
yea, saw that ticket. John had also mentioned yesterday that
materialize_directory
needed some love.
SnapshotSubset would be massively easier to implement with this API i think… and thus likely to be more efficient.
👍 1
…hm. now that i think about it. we basically already have all the pieces of this.
…hm. yea. you could already do this by consuming/filtering/rewriting
Snapshot.path_stats
into a new
Snapshot
via `Snapshot::from_path_stats`…
so… maybe no new API needed. just different conventions
👍 1
h
Does this impact caching semantics of FS operations when you’re not using pantsd? I believe those are cached in lmdb currently - would they still be?
w
a
Snapshot
is the in-memory representation of a merkle tree of `Digest`s in the database
👍 1
so this is just a different way to implement those methods, which currently “walk” the merkle tree from the database, loading directories and files as they go
(honestly, i can’t tell you how excited i am that we actually have to optimize this, because that means it’s really being used, heh)
💯 1
h
So, the only thing that we cache in lmdb is
Digest
, then? We don’t cache the result of
await Get[Digest](AddPrefix)
? We will cache the resulting
Digest
in lmdb, but not that this particular
AddDigest
request -> this particular output digest?
w
it’s important to differentiate “caching” from “storing”
1
we store the contents of
Snapshots
(effectively) in the database as a merkle tree (at least in part because it’s the remote exec format)
independently, we cache process executions.
the cache of process executions is tiny: it’s a cache from “digest of all inputs” to “digest of output”
hitting the cache is
Digest
->
Digest
… you can then grab the output
Digest
(roughly, i’m paraphasing) and look it up in the store to get a snapshot
(paraphrasing because inputs and outputs are of protobuf input/output structs)
@hundreds-father-404: so, to answer your question: those other intrinsics are not cached in the database the way process executions are.
👍 1
they could be if you defined them in terms of protobufs and then serialized them, but they aren’t currently.
on the other hand, all
Node
s are “memoized” in pantsd using the rust equivalent of
eq
/`hash`
👍 1
so pantsd can keep things warm even if we don’t have them defined in terms of protobufs and persisted on disk.