I finished my PoC for having a processes be batche...
# development
I finished my PoC for having a processes be batched for effeciency but split for the cache. It...... works! The code bakes in a lot of assumptions/hacks and the biggest issue/question is how to handle the one process result vs many process results (how to split/merge stdout/stderr) But... for a PoC it lays a neat foundation. First run:
./pants --no-pantsd -ldebug --stats-log fmt --only=black ::
(I only retrofitted black for PoC) shows
local_cache_requests_uncached: 1048
(vs 14 on
) When I then run
./pants --no-pantsd -ldebug --stats-log fmt --only=black src/python/pants/backend/python/lint/black/rules.py
I don't see any processes being run for
(there are uncached requests/processes for getting the ICs). https://github.com/thejcannon/pants/tree/synthcacheproc
🙌 2
👏 2
CC @witty-crayon-22786 @happy-kitchen-89482 I've had a severely reduced capacity for Pants contributions lately. I've had 2 doctors tell me I should focus getting more sleep, so there goes my hobby time. Therefore it'll be a PoC for a while, but... pretty neat stuff 😈
💜 1
😴 1
The TL;DR • Client opts into a new type. For PoC it's just a tuple of `Process`s but in the future would be more granular in the info. There's tradeoffs to be made when you opt into the new type. • The
command runner is responsible for taking the batch and querying the cache for each
object. It collects the uncached process objects to be merged and ran-as-one (TBD on whether
does the merging-and-running. PoC has it in
) • To merge we basically just combine the input digests into one and chain all the files to append to the "core" argv • If the process was successful, store the individual process runs in the cache (TBD on the output info) • (TBD collate the uncached batch run + cached results into a final result object)
👍 1
I've had 2 doctors tell me I should focus getting more sleep, so there goes my hobby time.
please do take care of yourself!! I've had to practice this a lot too this past month. Programming is a particularly addicting hobby w/ the sense of accomplishment 😱
Sleep is glorious and I can highly recommend it 🙂
this PoC is glorious too, but not worth sacrificing your health for
Well the culrpit is my son, whom I nor my Drs are able to convince to sleep better 🙂
On topic tho: I think one way to handle stdour/stderr is write a function passed down and called in Rust (if possible) to split/collate the output. It's brittle though because it's dependent on both tool's output decisions and verbosity levels.
Can you clarify the "store the individual process runs in the cache" part? How do you split a result of a merged run into individual per-file "processes" ?
The key part is already taken care of since (in the PoC) we have the process objects (int he future I think we'd synthesize the processes from base info + per-process-info). The value is... well for the PoC it's hand-waivy
Here's the Rust side for caching: https://github.com/thejcannon/pants/blob/57a324d90ef4cf4f05f8c27e207e00ffce00d6eb/src/rust/engine/process_execution/src/cache.rs#L128 Specifically the cache key is the process, the value is just a copy of the output from the run-with-all-files run 🙈
The strategy of "batch together" instead of "split apart" ensures I don't have to split inputs, just outputs.
by "we have the process objects" I take it you mean the process inputs, but what do you put in the process result that is cached? Just a fake exit code of 0 and no other outputs?
• Only cache if exit code is 0 • Output info is a copy of the batched processes (for PoC, would need to be smarter for actual solution)
sorry for taking so long to look at this!
the rough shape looks reasonable… the biggest question around the whole thing is just whether the user-space /
API can be simple enough to make it worthwhile, including making splitting of outputs simple… i don’t know of any linters with enough JSON output to split safely, but our built in tools like dependency extraction could probably
fwiw, we had a splitting/merging strategy for a tool in v1, and it was the biggest source of bugs in the whole system (admittedly, it was being used on a compiler, where inter-dependencies are the norm, but)
if “all of these files fail/succeed together then they will fail independently” is not a guarantee (processes are not necessarily Associative), so even caching the error code and doing no splitting will require caution
so… if there are enough usecases that can actually live within those constraints and gain some benefit, then maybe.
but as pointed out on the batching-inference ticket,
is probably not one of the ones where this is the case… tests can definitely have sideffects on one another. so would need to be disabled by default.
(not to mention the fact that it would be a huge refactor of the
for my part, i will continue to focus on lowering per-process overheads, because there continues to be low hanging fruit there, and fixing it allows
code to be written in a readable and cache-friendly way
(for example: getting
stable and used for PEXes would drop a lot of input overhead)
Yeah I suspect this gets opted into per-tool and by the user (we shouldn't force them into this, as Benjy likes to say we can provide a "slider") Then we choose the tools. I think it's really things where output on success doesn't matter much and we're sure the files don't affect each other. Fmt/lint/check. Because we can punt on splitting output if we only cache success, and then just toss the output out the window (we already kinda toss it for formatters)
A big one for this might be dep inference
Instead of running 1000 processes to infer on 1000 files, run, say 10
And since we control the output, we can make it really easy to split
Well I think v1 will have us ignoring output, since it's safe and easy. But yeah hopefully over time we can loosen that and dep inference can be batched and per-file-cached