Any thoughts if REAPI would consider adding an ign...
# development
h
Any thoughts if REAPI would consider adding an ignore mechanism to
output_paths
, like
output_paths=("dir/", "!dir/ignore_me")
? A substantial slowdown for our Go support is loading all the downloaded modules. We need to capture from the Go process all of
pkg/mod
, but we would be safe to ignore
pkg/mod/cache
. On my machine, ~20% of the size of my downloaded Go modules is from that folder
A substantial slowdown for our Go support is loading all the downloaded modules.
I found in a trace last week that loading the
process.output_digest
from LMDB was taking several seconds. This is causing target generation to be really slow for Go when it's not memoized already
w
could you relocate the cache instead?
h
What do you mean? Using
SnapshotSubset
etc? Or changing the
Process
(e.g. argv and env vars) to output differently?
w
fwiw, i do think that our experience with gitignore style excludes has been really positive, so they might be accepted as an extension. but i also haven’t been involved there for a while. maybe @average-vr-56795 has thoughts
👍 1
What do you mean? Using 
SnapshotSubset
 etc? Or changing the 
Process
 (e.g. argv and env vars) to output differently?
the latter. reconfiguring the cache location.
a
I suspect people would look at it a bit funny, but probably be ok with it?
👍 1
h
Do you know if Bazel suffers from large digests slowing them down too? I'm still filing the ticket for it, but it was ~4 seconds to load the digest from LMDB iirc
w
Large directory structures, or large files? High counts of either files or directories tends to be a bottleneck moreso than total size
a
Internally not so much - they've put a lot of work into optimising how they store and manipulate structures to minimise that overhead (e.g. one of the core bazel data structures is a "nested set" - a set which is a union of other sets) so that they can pass around and handle references to things like "the files in a go distribution" from a bunch of actions with very low overhead
Their file transfer code between client and server is pretty unoptimised, which causes some mild woe, but only the first time each file is uploaded somewhere, so it amortises pretty well. But if you're the first person to build some go on a cluster, it's not going to be amazing for you
There's actually also a lot of overhead from the fact that the representations are different - converting nested sets (really optimised for "these things are related") into merkle trees (really optimised for "this representation is canonical and consistent") is also surprisingly overhead-y
And has been discussion about changing the REAPI format to allow a more nested-set-like representation, though it never goes anywhere because there are still optimisations that can be done in the existing implementation
The other big difference for local execution is that actions are all run in symlink forests not places with re-materialised files
(Which probably has similar performance characteristics in practice to the "symlink out to the go toolchain" stuff Tom was designing)
f
could you relocate the cache instead?
or maybe run a
rm -rf path/to/cache
before returning from the
Process
invocation in which you are capturing that path?
h
Thank you Daniel! And oh, good idea Tom!! We're already using a bash script to run Go code, so that is really easy to do
w
@average-vr-56795: mm, yea. funny you should mention DepSets… we were looking at that yesterday. we’ve used a similar concept in a few places (TransitiveTarget, CoarsenedTarget most recently), but Eric hit a case yesterday with a recursive structure, and it’s becoming clearer why it is generalized in Bazel
1
and due (in-part) to their docs, i realized the connection to
Directory
only this morning: https://github.com/pantsbuild/pants/issues/13112#issuecomment-962027101
so maybe the answer to https://github.com/pantsbuild/pants/issues/13112 is actually a generic recursive ordered set (maybe in Rust, to ease porting the filesystem operations on #13112)
👍 1