How important is it to the StreamingWorkunitHandler API that Pants #development

How important is it to the StreamingWorkunitHandle...

hundreds-father-404

11/14/2021, 9:45 AM

How important is it to the StreamingWorkunitHandler API that the

artifacts

entry uses

Snapshot

rather than

Digest

hundreds-father-404

11/14/2021, 9:46 AM

Iirc we wanted to distinguish between

Digest

(aka directory digest) and

FileDigest

looking the same, but that seems overkill and I think we worked around that? Storing as a

Snapshot

eagerly means that we're going to have bigger workunits than necessary because we're storing all the file paths eagerly. Vs letting plugins hydrate whichever Digests they actually care about

curved-television-6568

11/14/2021, 9:49 AM

Where can I read up on this api? Any pointer to an example of how it is consumed..?

hundreds-father-404

11/14/2021, 9:49 AM

(My motivation is porting some of our workunits code to Rust post-PyO3, specifically adding the classes

PyWorkunit

PyWorkunitArtifact

, and

PyPolledWorkunits

. We need the GIL and to use async code if we stick with using

Snapshot

when mapping from

Workunit

PyWorkunit

. If we make this change, that code can now be done when the GIL is released, which is excellent for performance)

hundreds-father-404

11/14/2021, 9:52 AM

It's underdocumented on docs site, but https://github.com/pantsbuild/pants/blob/main/src/python/pants/engine/engine_aware.py You can subclass

EngineAwareReturnType

and

EngineAwareParameter

with your Python types, which is how you do all the cool stuff like streaming test results as they come in In addition to Pants's dynamic UI and logging, there is also a

StreamingWorkunitHandler

API that allows plugins to inspect all the "workunits" and do whatever they'd like with it. For example, it's how

--stats-log

works: https://github.com/pantsbuild/pants/blob/main/src/python/pants/goal/stats_aggregator.py

👍 1

💯 1

hundreds-father-404

11/14/2021, 9:54 AM

Vs letting plugins hydrate whichever Digests they actually care about

A good example of why this matters. Both of Pants's

StreamingWorkunitHandler

plugins never even consume

artifacts

! That is, anonymous telemetry and

--stats-log

We're wasting performance eagerly hydrating there + storing the paths in the workunit Now, not sure it's that noticeable. But we do generally know that workunit handling has a perf impact: https://github.com/pantsbuild/pants/pull/13483 And unblocks that PyO3 work

hundreds-father-404

11/16/2021, 9:11 PM

cc @polite-garden-50641, any opinion on the streaming workunit API no longer eagerly expanding

Digest -> Snapshot

for the

artifacts

key? If you care about the file paths in the directory Digest, you would request it from the engine after getting back all the workunits, for only the artifacts you care about. Just like how

FileDigest

works now

polite-garden-50641

11/16/2021, 9:14 PM

I don't think we do it eagerly... but I am not sure... u can look at the toolchain code and see. At any case, I would prefer if this API doesn't change unless we absolutely have a good reason to modify it. This is mostly because it is not trivial to handle those changes in terms of the toolchain plugin and the need to be able to support multiple pants versions with the plugin.

polite-garden-50641

11/16/2021, 9:19 PM

the code is here: https://github.com/toolchainlabs/toolchain/blob/master/src/python/toolchain/pants/buildsense/converter.py (sorry for the private link). if people want to see the source code for this it is currently only available by downloading the package source (https://pypi.org/project/toolchain.pants.plugin/#files) and extracting it, then you can see the source code.

👀 1

hundreds-father-404

11/16/2021, 9:19 PM

I don't think we do it eagerly

It is eager due to Pants. Basically, a

Digest

is super cheap, just a SHA 256. Whereas

Snapshot

is more expensive because it has to copy around all the file paths, and also look those up eagerly by reading from LMDB We can get less overhead by fixing that to be lazy

I would prefer if this API doesn't change

Ack. I think if we do make changes, it'd probably be better to bundle also. The two main changes I'm suggesting are: 1. Change the

artifacts

to use

Digest

instead of

Snapshot

2. Return a

PyWorkunit

class, rather than raw dictionary. You'd have typed fields like

name: str

and

parent_id: str | None

(For #2, our Rust code is much simpler if we take that approach. I'm planning on eagerly converting the class into an untyped dict for now to avoid breaking compat)

❤️ 1

polite-garden-50641

11/16/2021, 9:20 PM

I don't fully understand the context for this. feel free to put something on the calendar so we can sync about this.

👍 1

hundreds-father-404

11/16/2021, 9:22 PM

will wait till you're back from trip, this is low priority. Mostly things I'm noticing could be Nicer™️ while rewriting our Rust <-> Python FFI code to PyO3

curved-television-6568

11/16/2021, 9:24 PM

Perhaps a streaming api version handshake so the consumers can be upgraded gracefully..?

👀 1

polite-garden-50641

11/16/2021, 9:26 PM

the challenge is with importing stuff and types changing... the toolchain plugin imports stuff from pants and when those change and we need to support multiple versions it complicates things..

👍 1

hundreds-father-404

11/16/2021, 9:27 PM

I'll start with the background PR I've been doing to implement #2, the

PyWorkunit

type. It will have no end-user impact because I'll convert it into a dictionary - but I think it will make this discussion more clear

polite-garden-50641

11/16/2021, 9:30 PM

is there a demonstrable upside to doing such a refactor and api change ? will it impact perf ? make things simpler in the pants code base?

hundreds-father-404

11/16/2021, 9:37 PM

For `PyWorkunit`: • a better, typed API for consumers, rather than an untyped dict • simpler code for us For storing

Snapshot

instead of

Digest

, motivation is performance: • Don't store more data than necessary & don't read from LMDB when not necessary • Allows us to batch holding the GIL better, meaning we have better concurrency when processing workunits • simpler code for us

👍 1

💯 1

polite-garden-50641

11/16/2021, 10:07 PM

sgtm!

Open in Slack

Previous Next