since we have a few optimize rust intrinsic X for Snapshots Pants #development

since we have a few “optimize rust intrinsic X for...

witty-crayon-22786

05/06/2020, 8:40 PM

since we have a few “optimize rust intrinsic X for Snapshots” items coming up right now (and with the caveat that this is “optimizing before having profiled”), i should mention that way back in the day, @average-vr-56795 and i discussed having an in-memory-optimized form of

DirectoryDigest

Snapshot

, which would effectively be a

HashMap<PathBuf, (File|Directory)>

to make those kinds of functions easier to write. currently we lazily load from the database, but i think that that has made a lot of things harder to implement efficiently.

witty-crayon-22786

05/06/2020, 8:40 PM

on the other hand, it could totally be something else that is the bottleneck.

witty-crayon-22786

05/06/2020, 8:41 PM

(certainly for

materialize_directory

)

witty-crayon-22786

05/06/2020, 8:41 PM

cc @enough-analyst-54434 ^

hundreds-father-404

05/06/2020, 8:42 PM

The only intrinsic that seems really slow from my profiling is `SnapshotSubset`: https://github.com/pantsbuild/pants/issues/9706 The two other optimizations I did are entirely on the Python side to simply avoid unnecessary work

witty-crayon-22786

05/06/2020, 8:43 PM

yea, saw that ticket. John had also mentioned yesterday that

materialize_directory

needed some love.

witty-crayon-22786

05/06/2020, 8:44 PM

SnapshotSubset would be massively easier to implement with this API i think… and thus likely to be more efficient.

👍 1

witty-crayon-22786

05/06/2020, 8:45 PM

…hm. now that i think about it. we basically already have all the pieces of this.

witty-crayon-22786

05/06/2020, 8:49 PM

…hm. yea. you could already do this by consuming/filtering/rewriting

Snapshot.path_stats

into a new

Snapshot

via `Snapshot::from_path_stats`…

witty-crayon-22786

05/06/2020, 8:49 PM

so… maybe no new API needed. just different conventions

👍 1

hundreds-father-404

05/06/2020, 8:51 PM

Does this impact caching semantics of FS operations when you’re not using pantsd? I believe those are cached in lmdb currently - would they still be?

witty-crayon-22786

05/06/2020, 8:53 PM

Snapshot

is the in-memory representation of a merkle tree of `Digest`s in the database

👍 1

witty-crayon-22786

05/06/2020, 8:53 PM

so this is just a different way to implement those methods, which currently “walk” the merkle tree from the database, loading directories and files as they go

witty-crayon-22786

05/06/2020, 8:54 PM

(honestly, i can’t tell you how excited i am that we actually have to optimize this, because that means it’s really being used, heh)

💯 1

hundreds-father-404

05/06/2020, 8:56 PM

So, the only thing that we cache in lmdb is

Digest

, then? We don’t cache the result of

await Get[Digest](AddPrefix)

? We will cache the resulting

Digest

in lmdb, but not that this particular

AddDigest

request -> this particular output digest?

witty-crayon-22786

05/06/2020, 8:57 PM

it’s important to differentiate “caching” from “storing”

➕ 1

witty-crayon-22786

05/06/2020, 8:58 PM

we store the contents of

Snapshots

(effectively) in the database as a merkle tree (at least in part because it’s the remote exec format)

witty-crayon-22786

05/06/2020, 8:59 PM

independently, we cache process executions.

witty-crayon-22786

05/06/2020, 9:00 PM

the cache of process executions is tiny: it’s a cache from “digest of all inputs” to “digest of output”

witty-crayon-22786

05/06/2020, 9:00 PM

hitting the cache is

Digest

Digest

… you can then grab the output

Digest

(roughly, i’m paraphasing) and look it up in the store to get a snapshot

witty-crayon-22786

05/06/2020, 9:01 PM

(paraphrasing because inputs and outputs are of protobuf input/output structs)

witty-crayon-22786

05/06/2020, 9:02 PM

@hundreds-father-404: so, to answer your question: those other intrinsics are not cached in the database the way process executions are.

👍 1

witty-crayon-22786

05/06/2020, 9:02 PM

they could be if you defined them in terms of protobufs and then serialized them, but they aren’t currently.

witty-crayon-22786

05/06/2020, 9:03 PM

on the other hand, all

Node

s are “memoized” in pantsd using the rust equivalent of

eq

/`hash`

👍 1

witty-crayon-22786

05/06/2020, 9:04 PM

so pantsd can keep things warm even if we don’t have them defined in terms of protobufs and persisted on disk.

Open in Slack

Previous Next