hundreds-father-404
11/14/2021, 9:45 AMartifacts
entry uses Snapshot
rather than Digest
?hundreds-father-404
11/14/2021, 9:46 AMDigest
(aka directory digest) and FileDigest
looking the same, but that seems overkill and I think we worked around that?
Storing as a Snapshot
eagerly means that we're going to have bigger workunits than necessary because we're storing all the file paths eagerly. Vs letting plugins hydrate whichever Digests they actually care aboutcurved-television-6568
11/14/2021, 9:49 AMhundreds-father-404
11/14/2021, 9:49 AMPyWorkunit
, PyWorkunitArtifact
, and PyPolledWorkunits
.
We need the GIL and to use async code if we stick with using Snapshot
when mapping from Workunit
to PyWorkunit
. If we make this change, that code can now be done when the GIL is released, which is excellent for performance)hundreds-father-404
11/14/2021, 9:52 AMEngineAwareReturnType
and EngineAwareParameter
with your Python types, which is how you do all the cool stuff like streaming test results as they come in
In addition to Pants's dynamic UI and logging, there is also a StreamingWorkunitHandler
API that allows plugins to inspect all the "workunits" and do whatever they'd like with it. For example, it's how --stats-log
works: https://github.com/pantsbuild/pants/blob/main/src/python/pants/goal/stats_aggregator.pyhundreds-father-404
11/14/2021, 9:54 AMVs letting plugins hydrate whichever Digests they actually care aboutA good example of why this matters. Both of Pants's
StreamingWorkunitHandler
plugins never even consume artifacts
! That is, anonymous telemetry and --stats-log
We're wasting performance eagerly hydrating there + storing the paths in the workunit
Now, not sure it's that noticeable. But we do generally know that workunit handling has a perf impact: https://github.com/pantsbuild/pants/pull/13483
And unblocks that PyO3 workhundreds-father-404
11/16/2021, 9:11 PMDigest -> Snapshot
for the artifacts
key? If you care about the file paths in the directory Digest, you would request it from the engine after getting back all the workunits, for only the artifacts you care about. Just like how FileDigest
works nowpolite-garden-50641
11/16/2021, 9:14 PMpolite-garden-50641
11/16/2021, 9:19 PMhundreds-father-404
11/16/2021, 9:19 PMI don't think we do it eagerlyIt is eager due to Pants. Basically, a
Digest
is super cheap, just a SHA 256. Whereas Snapshot
is more expensive because it has to copy around all the file paths, and also look those up eagerly by reading from LMDB
We can get less overhead by fixing that to be lazy
I would prefer if this API doesn't changeAck. I think if we do make changes, it'd probably be better to bundle also. The two main changes I'm suggesting are: 1. Change the
artifacts
to use Digest
instead of Snapshot
2. Return a PyWorkunit
class, rather than raw dictionary. You'd have typed fields like name: str
and parent_id: str | None
(For #2, our Rust code is much simpler if we take that approach. I'm planning on eagerly converting the class into an untyped dict for now to avoid breaking compat)polite-garden-50641
11/16/2021, 9:20 PMhundreds-father-404
11/16/2021, 9:22 PMcurved-television-6568
11/16/2021, 9:24 PMpolite-garden-50641
11/16/2021, 9:26 PMhundreds-father-404
11/16/2021, 9:27 PMPyWorkunit
type. It will have no end-user impact because I'll convert it into a dictionary - but I think it will make this discussion more clearpolite-garden-50641
11/16/2021, 9:30 PMhundreds-father-404
11/16/2021, 9:37 PMSnapshot
instead of Digest
, motivation is performance:
⢠Don't store more data than necessary & don't read from LMDB when not necessary
⢠Allows us to batch holding the GIL better, meaning we have better concurrency when processing workunits
⢠simpler code for uspolite-garden-50641
11/16/2021, 10:07 PM