I'm also trying to understand better what a `Works...
# development
b
I'm also trying to understand better what a
Workspace
is. Is this like a hermetic temp directory made for each rule execution, or process invocation? Based on poking around at code, I'm having trouble deciding whether it's just for interacting with the equivalent of
dist/
, (i.e. "repo global"), or if it's a sandbox for hermetizing task activity
w
no, it is the git working copy, and intended for producing sideeffects in the working copy
so it’s only accessible from
@goal_rule
s
and yea: putting things in
dist
, but it’s also capable of writing files anywhere else in the working copy, so it’s used by the
fmt
goal to re-format code, and
tailor
to write update BUILD files
b
yup seeing it mostly in those tasks in git grep is what made me skeptical that it was for anything else
So is the idea that accessing "hermetic" outputs is strictly done via
output_directories
and
output_files
?
This leads me to another question: is there a notion of just implicitly capturing all writes to the
Process
chroot? (I'm assuming that all processes are run in some sort of chroot-like temp dir)
Like, setting aside whether it's a good idea, does it make sense to say
output_directories=(".",)
?
w
is there a notion of just implicitly capturing all writes to the 
Process
 chroot?
no, there isn’t… this is mostly for conformance with the remoting API
it only allows for literals
and that just propagated from bazel’s API
b
I see, and I assume the intent there is just an optimization to avoid capturing unnecessary output?
w
yea… that and probably not trying to standardize a glob-matching implementation in the gRPC interface
but since output directories are supported, if you want to capture “a bunch of unknown stuff”, writing to (or moving into) an output directory and then capturing that directory will generally work
… i don’t know off the top of my head whether
output_directories=(".",)
would work, but i think that it would rarely be what you want, because it would re-capture all of the inputs as well
b
Right, makes sense. Just trying to get a handle on the API limitations and intent
👍 2
Thanks!
w
(mm, and i need to remind myself to try to refer to docs where possible so that we can try to improve them: the docs on `Workspace`: https://www.pantsbuild.org/docs/rules-api-file-system#workspacewrite_digest)
b
Yeah I'd read that, but I wasn't 100% sure at the time if "build root" meant what I thought it did (i.e. what it used to). But it does
👍 1
w
more generally: there are no permanent scratch directories in v2 other than the named_caches
b
And even those it seems need to be treated with care, as though they're ephemeral to a
Process
invocation but might give you performance improvements because of sideband caching
w
all
Process
es run in sandboxes, which you can preserve and inspect by setting
--no-process-execution-cleanup-local-dirs
b
Oh good to know
w
yea, i’ll double check that that is in the docs
And even those it seems need to be treated with care, as though they’re ephemeral to a 
Process
 invocation but might give you performance improvements because of sideband caching
yea, exactly. they’re only for well behaved append-only caches.
ok, made some doc edits: added
--no-*-cleanup
to the
Process
page. was already on the debugging page: https://www.pantsbuild.org/docs/rules-api-tips#debugging-look-inside-the-chroot
💯 1
f
… i don’t know off the top of my head whether output_directories=(“.”,) would work, but i think that it would rarely be what you want, because it would re-capture all of the inputs as well
REAPI does support the empty string as an output directory to capture the entire input root, but (a) I’m not sure Pants supports it; and (b) I agree with Stu that is probably not what you want. I suggest designing how you want jar files to show up in input roots and then have the fetch process generate that directory structure when capturing.
for example, (1) set the cache directory location to a known value in the input root that won’t conflict with other usages,
__coursier__/cache
and (2) then capture specific jar resolutions from the cache as output directories, e.g.
__coursier__/cache/https/repo1.maven.org/maven2/io/circe/circe-generic_2.13/0.12.3/
(or as an output file if that works better). then you’ll have a digest for each resolved jar and could construct a specific input root from those digests
(tangentially, I’ve been thinking about “synthetic” process executions which could help with splitting a single batch process execution into multiple individual “synthetic” process executions. think for example about turning the single result from
coursier fetch foo:1.2.3 bar:5.6.7
into two synthetic actions for
coursiser fetch foo:1.2.3
and
coursier fetch bar:5.6.7
)
(given that, in some sense, the execution model prefers creating command-lines for process executions that have already been run and thus are cached)
w
resolves tend to be very resistant to splitting though. does coursier have separate “resolve” vs “fetch” phases…?
yea, it looks like
fetch
is transitive, so you can’t split it
b
It does have separate resolve/fetch, so it's possible in principle (you could resolve, parse the result down to individual deps, then fetch individual deps)
👍 1
I'm curious though, how smart is the underlying CAKVS about deduplicating identical files within a given digest? I had assumed that was going on under the hood, so I shouldn't worry too much right off the bat about attempting to manually split up individual jars within a
Digest
w
It is. So it should be fine to run a monolithic resolve and then capture it as one digest to split it
I think that the primary reason not to do that though would be if you wanted independent cache keys for each download... particularly for coursier, the named_caches should make that less of an issue. BUT the named_caches are an extension to the remote execution protocol which we have not implemented in any servers yet.
b
Yeah, and it also cleanly splits the resolve, which is inherently non-hermetic, from the individual fetches, which should in principle be ~indefinitely cacheable
w
Yep. So potentially worth it.
Even if the resolve/fetch were split, you'd still be able to take advantage of the named caches for each of them independently (The named cache discussion is over here: https://groups.google.com/g/remote-execution-apis/c/mA87-IDNpec/m/Nt9GL1NrBAAJ)
Although, re hermetic/non-hermetic: unless each split-out fetch was URL+expected digest, it still might not be reproducible. Nonetheless.
b
yeah