I m also trying to understand better what a `Workspace` is I Pants #development

I'm also trying to understand better what a `Works...

bored-art-40741

03/21/2021, 5:47 PM

I'm also trying to understand better what a

Workspace

is. Is this like a hermetic temp directory made for each rule execution, or process invocation? Based on poking around at code, I'm having trouble deciding whether it's just for interacting with the equivalent of

dist/

, (i.e. "repo global"), or if it's a sandbox for hermetizing task activity

witty-crayon-22786

03/21/2021, 6:06 PM

no, it is the git working copy, and intended for producing sideeffects in the working copy

witty-crayon-22786

03/21/2021, 6:06 PM

so it’s only accessible from

@goal_rule

witty-crayon-22786

03/21/2021, 6:09 PM

and yea: putting things in

dist

, but it’s also capable of writing files anywhere else in the working copy, so it’s used by the

fmt

goal to re-format code, and

tailor

to write update BUILD files

bored-art-40741

03/21/2021, 6:10 PM

yup seeing it mostly in those tasks in git grep is what made me skeptical that it was for anything else

bored-art-40741

03/21/2021, 6:10 PM

So is the idea that accessing "hermetic" outputs is strictly done via

output_directories

and

output_files

bored-art-40741

03/21/2021, 6:12 PM

This leads me to another question: is there a notion of just implicitly capturing all writes to the

Process

chroot? (I'm assuming that all processes are run in some sort of chroot-like temp dir)

bored-art-40741

03/21/2021, 6:13 PM

Like, setting aside whether it's a good idea, does it make sense to say

output_directories=(".",)

witty-crayon-22786

03/21/2021, 6:16 PM

is there a notion of just implicitly capturing all writes to the
Process
chroot?

no, there isn’t… this is mostly for conformance with the remoting API

witty-crayon-22786

03/21/2021, 6:16 PM

it only allows for literals

witty-crayon-22786

03/21/2021, 6:16 PM

and that just propagated from bazel’s API

bored-art-40741

03/21/2021, 6:16 PM

I see, and I assume the intent there is just an optimization to avoid capturing unnecessary output?

witty-crayon-22786

03/21/2021, 6:17 PM

yea… that and probably not trying to standardize a glob-matching implementation in the gRPC interface

witty-crayon-22786

03/21/2021, 6:18 PM

but since output directories are supported, if you want to capture “a bunch of unknown stuff”, writing to (or moving into) an output directory and then capturing that directory will generally work

witty-crayon-22786

03/21/2021, 6:19 PM

… i don’t know off the top of my head whether

output_directories=(".",)

would work, but i think that it would rarely be what you want, because it would re-capture all of the inputs as well

bored-art-40741

03/21/2021, 6:20 PM

Right, makes sense. Just trying to get a handle on the API limitations and intent

👍 2

bored-art-40741

03/21/2021, 6:20 PM

Thanks!

witty-crayon-22786

03/21/2021, 6:22 PM

(mm, and i need to remind myself to try to refer to docs where possible so that we can try to improve them: the docs on `Workspace`: https://www.pantsbuild.org/docs/rules-api-file-system#workspacewrite_digest)

bored-art-40741

03/21/2021, 6:23 PM

Yeah I'd read that, but I wasn't 100% sure at the time if "build root" meant what I thought it did (i.e. what it used to). But it does

👍 1

witty-crayon-22786

03/21/2021, 6:25 PM

more generally: there are no permanent scratch directories in v2 other than the named_caches

bored-art-40741

03/21/2021, 6:26 PM

And even those it seems need to be treated with care, as though they're ephemeral to a

Process

invocation but might give you performance improvements because of sideband caching

witty-crayon-22786

03/21/2021, 6:26 PM

all

Process

es run in sandboxes, which you can preserve and inspect by setting

--no-process-execution-cleanup-local-dirs

bored-art-40741

03/21/2021, 6:27 PM

Oh good to know

witty-crayon-22786

03/21/2021, 6:27 PM

yea, i’ll double check that that is in the docs

witty-crayon-22786

03/21/2021, 6:28 PM

And even those it seems need to be treated with care, as though they’re ephemeral to a
Process
invocation but might give you performance improvements because of sideband caching

yea, exactly. they’re only for well behaved append-only caches.

witty-crayon-22786

03/21/2021, 6:36 PM

ok, made some doc edits: added

--no-*-cleanup

to the

Process

page. was already on the debugging page: https://www.pantsbuild.org/docs/rules-api-tips#debugging-look-inside-the-chroot

💯 1

fast-nail-55400

03/21/2021, 7:23 PM

… i don’t know off the top of my head whether output_directories=(“.”,) would work, but i think that it would rarely be what you want, because it would re-capture all of the inputs as well

REAPI does support the empty string as an output directory to capture the entire input root, but (a) I’m not sure Pants supports it; and (b) I agree with Stu that is probably not what you want. I suggest designing how you want jar files to show up in input roots and then have the fetch process generate that directory structure when capturing.

fast-nail-55400

03/21/2021, 7:31 PM

for example, (1) set the cache directory location to a known value in the input root that won’t conflict with other usages,

__coursier__/cache

and (2) then capture specific jar resolutions from the cache as output directories, e.g.

__coursier__/cache/https/repo1.maven.org/maven2/io/circe/circe-generic_2.13/0.12.3/

(or as an output file if that works better). then you’ll have a digest for each resolved jar and could construct a specific input root from those digests

fast-nail-55400

03/21/2021, 7:35 PM

(tangentially, I’ve been thinking about “synthetic” process executions which could help with splitting a single batch process execution into multiple individual “synthetic” process executions. think for example about turning the single result from

coursier fetch foo:1.2.3 bar:5.6.7

into two synthetic actions for

coursiser fetch foo:1.2.3

and

coursier fetch bar:5.6.7

)

fast-nail-55400

03/21/2021, 7:37 PM

(given that, in some sense, the execution model prefers creating command-lines for process executions that have already been run and thus are cached)

witty-crayon-22786

03/21/2021, 7:38 PM

resolves tend to be very resistant to splitting though. does coursier have separate “resolve” vs “fetch” phases…?

witty-crayon-22786

03/21/2021, 7:40 PM

yea, it looks like

fetch

is transitive, so you can’t split it

bored-art-40741

03/21/2021, 7:55 PM

It does have separate resolve/fetch, so it's possible in principle (you could resolve, parse the result down to individual deps, then fetch individual deps)

👍 1

bored-art-40741

03/21/2021, 8:00 PM

I'm curious though, how smart is the underlying CAKVS about deduplicating identical files within a given digest? I had assumed that was going on under the hood, so I shouldn't worry too much right off the bat about attempting to manually split up individual jars within a

Digest

witty-crayon-22786

03/21/2021, 8:01 PM

It is. So it should be fine to run a monolithic resolve and then capture it as one digest to split it

witty-crayon-22786

03/21/2021, 8:03 PM

I think that the primary reason not to do that though would be if you wanted independent cache keys for each download... particularly for coursier, the named_caches should make that less of an issue. BUT the named_caches are an extension to the remote execution protocol which we have not implemented in any servers yet.

bored-art-40741

03/21/2021, 8:05 PM

Yeah, and it also cleanly splits the resolve, which is inherently non-hermetic, from the individual fetches, which should in principle be ~indefinitely cacheable

witty-crayon-22786

03/21/2021, 8:06 PM

Yep. So potentially worth it.

witty-crayon-22786

03/21/2021, 8:06 PM

Even if the resolve/fetch were split, you'd still be able to take advantage of the named caches for each of them independently (The named cache discussion is over here: https://groups.google.com/g/remote-execution-apis/c/mA87-IDNpec/m/Nt9GL1NrBAAJ)

witty-crayon-22786

03/21/2021, 8:08 PM

Although, re hermetic/non-hermetic: unless each split-out fetch was URL+expected digest, it still might not be reproducible. Nonetheless.

bored-art-40741

03/21/2021, 8:20 PM

yeah

Open in Slack

Previous Next