How important is it to the StreamingWorkunitHandle...
# development
h
How important is it to the StreamingWorkunitHandler API that the
artifacts
entry uses
Snapshot
rather than
Digest
?
Iirc we wanted to distinguish between
Digest
(aka directory digest) and
FileDigest
looking the same, but that seems overkill and I think we worked around that? Storing as a
Snapshot
eagerly means that we're going to have bigger workunits than necessary because we're storing all the file paths eagerly. Vs letting plugins hydrate whichever Digests they actually care about
c
Where can I read up on this api? Any pointer to an example of how it is consumed..?
h
(My motivation is porting some of our workunits code to Rust post-PyO3, specifically adding the classes
PyWorkunit
,
PyWorkunitArtifact
, and
PyPolledWorkunits
. We need the GIL and to use async code if we stick with using
Snapshot
when mapping from
Workunit
to
PyWorkunit
. If we make this change, that code can now be done when the GIL is released, which is excellent for performance)
It's underdocumented on docs site, but https://github.com/pantsbuild/pants/blob/main/src/python/pants/engine/engine_aware.py You can subclass
EngineAwareReturnType
and
EngineAwareParameter
with your Python types, which is how you do all the cool stuff like streaming test results as they come in In addition to Pants's dynamic UI and logging, there is also a
StreamingWorkunitHandler
API that allows plugins to inspect all the "workunits" and do whatever they'd like with it. For example, it's how
--stats-log
works: https://github.com/pantsbuild/pants/blob/main/src/python/pants/goal/stats_aggregator.py
šŸ‘ 1
šŸ’Æ 1
Vs letting plugins hydrate whichever Digests they actually care about
A good example of why this matters. Both of Pants's
StreamingWorkunitHandler
plugins never even consume
artifacts
! That is, anonymous telemetry and
--stats-log
We're wasting performance eagerly hydrating there + storing the paths in the workunit Now, not sure it's that noticeable. But we do generally know that workunit handling has a perf impact: https://github.com/pantsbuild/pants/pull/13483 And unblocks that PyO3 work
cc @polite-garden-50641, any opinion on the streaming workunit API no longer eagerly expanding
Digest -> Snapshot
for the
artifacts
key? If you care about the file paths in the directory Digest, you would request it from the engine after getting back all the workunits, for only the artifacts you care about. Just like how
FileDigest
works now
p
I don't think we do it eagerly... but I am not sure... u can look at the toolchain code and see. At any case, I would prefer if this API doesn't change unless we absolutely have a good reason to modify it. This is mostly because it is not trivial to handle those changes in terms of the toolchain plugin and the need to be able to support multiple pants versions with the plugin.
the code is here: https://github.com/toolchainlabs/toolchain/blob/master/src/python/toolchain/pants/buildsense/converter.py (sorry for the private link). if people want to see the source code for this it is currently only available by downloading the package source (https://pypi.org/project/toolchain.pants.plugin/#files) and extracting it, then you can see the source code.
šŸ‘€ 1
h
I don't think we do it eagerly
It is eager due to Pants. Basically, a
Digest
is super cheap, just a SHA 256. Whereas
Snapshot
is more expensive because it has to copy around all the file paths, and also look those up eagerly by reading from LMDB We can get less overhead by fixing that to be lazy
I would prefer if this API doesn't change
Ack. I think if we do make changes, it'd probably be better to bundle also. The two main changes I'm suggesting are: 1. Change the
artifacts
to use
Digest
instead of
Snapshot
2. Return a
PyWorkunit
class, rather than raw dictionary. You'd have typed fields like
name: str
and
parent_id: str | None
(For #2, our Rust code is much simpler if we take that approach. I'm planning on eagerly converting the class into an untyped dict for now to avoid breaking compat)
ā¤ļø 1
p
I don't fully understand the context for this. feel free to put something on the calendar so we can sync about this.
šŸ‘ 1
h
will wait till you're back from trip, this is low priority. Mostly things I'm noticing could be Nicerā„¢ļø while rewriting our Rust <-> Python FFI code to PyO3
c
Perhaps a streaming api version handshake so the consumers can be upgraded gracefully..?
šŸ‘€ 1
p
the challenge is with importing stuff and types changing... the toolchain plugin imports stuff from pants and when those change and we need to support multiple versions it complicates things..
šŸ‘ 1
h
I'll start with the background PR I've been doing to implement #2, the
PyWorkunit
type. It will have no end-user impact because I'll convert it into a dictionary - but I think it will make this discussion more clear
p
is there a demonstrable upside to doing such a refactor and api change ? will it impact perf ? make things simpler in the pants code base?
h
For `PyWorkunit`: ā€¢ a better, typed API for consumers, rather than an untyped dict ā€¢ simpler code for us For storing
Snapshot
instead of
Digest
, motivation is performance: ā€¢ Don't store more data than necessary & don't read from LMDB when not necessary ā€¢ Allows us to batch holding the GIL better, meaning we have better concurrency when processing workunits ā€¢ simpler code for us
šŸ‘ 1
šŸ’Æ 1
p
sgtm!