Hello I'm currently looking into pantsbuild to re...
# general
f
Hello I'm currently looking into pantsbuild to replace our current build system. The current software is built using custom a "compiler", so I'm using the
adhoc_tool
rule to build. However the compiler also needs credentials provided through environment variables, which can be provided using
extra_env_vars
, however this parameter seems to also affect the cache, and thus greatly diminishing the benefit of pantsbuild. Is there any way I can provide credentials through environment variables to
adhoc_tool
without having it affect the cache digest?
w
How are you calling adhoc_tool? Can we see part of the build file? Have you tried messing with the cache_scope? Not sure if that would work - but I wonder if it would separate your results from the inputs a bit https://www.pantsbuild.org/stable/reference/targets/adhoc_tool#cache_scope
f
Hmm, I had a look at the
cache_scope
, though I don't see how it would help, if anything it'd do the opposite? Making the cache-scope smaller.
I made a demo of the issue I am facing:
Copy code
# Initial run
export AUTH_VAR="VERY_SECRET"
pants export-codegen :main
cat dist/codegen/file.txt
# Gives:
# Authenticated something with credentials: VERY_SECRET

# Now changing the env variable, I'd like the cache to persist, as this
# credential, is just a credential, it doesn't affect the output.
# It either succeeds, or fails, but given it succeeds, the result should be deterministic.
export AUTH_VAR="LESS_SECRET"
pants export-codegen :main
cat dist/codegen/file.txt

# But unfortunately, it now gives:
# Authenticated something with credentials: LESS_SECRET
w
Is it correctly cached if you don't change the extra env var - and keep the same one?
f
Yes
It essentially thinks
extra_env_vars
is an input which alters the output, and reruns, if it stays the same, it reuses previous cache. This in a way, is understandable since the env vars could indeed be considered as an input. Though in my case, they aren't, or some vars aren't at least. So would be nice to tweak/configure which env vars actually impact, and which don't.
w
Yeah, as these run Pants processes, the behaviour you're seeing is exactly what I would have expected
This is a bit weird, because you're passing the env vars to then get passed to the environment where the compiler runs?
I wonder if there is another way for them to be ambiently available?
f
Environment variables are part of the cache key by design. (The Pants execution model is modeled on the Google-defined Remote Execution API and env vars are in the cache key there.)
w
I meant ambiently, like system credentials - however, we strip environments by default. Does
workspace_environment
strip or maintain env vars from the system?
f
It would have the same behavior as local environment does for env vars.
They'd come from the
Process
the Pans rule is trying to run (and as modified by relevant config options).
"Ambuent" env vars would still be in the cache key. Ambient availability would have to be something like a
.netrc
file or similar.
w
yeah, or wherever else - or the compiler would have to break out of the sandbox/read a .env which can't change as part of an input digest
f
We could modify the local and workspace environments to add in env vars without them having been in
Process
but that would likely break
remote_environment
maybe not break
but cause more invalidations
(and add in some "ambient" ones from a separate config)
(not every env var to be configured)
w
Yeah, also could be a bit confusing at least. Rule of least surprise for me falls over - as I would expect to invalidate a cache if an env var passed in gets modified. So yeah, it would have to be something separate
f
So the more general issue here is do we ever support diverging from REAPI for local executions in a way that would be semantically different cache-wise for REAPI stuff.)
w
I was thinking workspace environment, specifically because it bypasses the hermetic sandbox already
@faint-yak-89693 Just a heads up, that you can always make a plugin to reproduce whatever behaviour you would like. I'm not sure offhandedly what other built-in mechanism we have to pass through env vars and not have them take part in caching. But, I'll leave that to someone else. I thought that most/all of our various levels of env vars get merged eventually
f
The handling of the cache key is handled in Rust code by the
Process
intrinsic rules, so plugins are not going to change that in Python alone.
w
I meant like, pointing the compiler at a file in the workspace environment, bypassing Process kinda thing
E.g. I’ve had to write a plugin where I needed a file on my system, that I refused to stick in a repo - and similarly, I didnt want it to invalidate cache if changed. Basically a static, system config
f
That might work. Have a file with
export FOO=BAR
stuff in it and source the file by name and make sure the file is not a dependency of the relevant target.
w
Yeah, basically that - my use case was MUCH weirder, but fundamentally could end up with the same result
f
Forgive me for my lack of understanding here but, if the argument why this would "never" be possible is because it would break hermeticity, and wouldn't work with REAPI. Wouldn't using a
~/.netrc
file outside the sandbox do the same, just being less transparent about it? As technically, something truely sandboxed, I wouldn't expect to be able to access
~/.netrc
?
As I mentioned, I'm in the processes of trying out, learning pantsbuild, hoping it'll be a good fit for my repository and use case. REAPI is something I'd like to be able to use in future, but haven't gotten around to testing, or truely understanding it's workings and limitations. (Perhaps this is entirely incorrect, so please slap me back to on track if so 😄 ) But I'd kind of hope/expect, that if I provide a env variable from local environment, but tell it to execute remotely, it'd copy those env vars (and credentials), run and perform the job and give me the results. But if it relies on a non-defined file outside the sandbox such as
~/.netrc
, it wouldn't copy/send that over to the remote execution, and would just fail? Or is the expected that the remote execution environment simply has credentials configured for all and everything?
f
Well
.netrc
would not work with REAPI because a local
.netrc
would not work with a remote build execution since it is not installed there. And sending a
.netrc
over REAPI means it is part of the input root for a remote execution request and thus part of the cache key.
> but tell it to execute remotely, it'd copy those env vars (and credentials), run and perform the job and give me the results. And the
Action
and
Command
protobufs in the REAPI which carry those env vars are hashed (indirectly) as part of the cache key.
f
So what is the recommended WoW with credentials? While the env vars will work, I'd rather avoid the situation where the entire cache gets invalidated because a credential expired. I have different types, some which technically last forever, other credentials which only last for a couple of hours.
f
There had been some work to solve a similar issue to this for Python indices: https://github.com/pantsbuild/pants/pull/21852 and https://github.com/pantsbuild/pants/pull/21853. That work was dropped for business reasons. @ripe-gigabyte-88964 revived the work in a different form with https://github.com/pantsbuild/pants/pull/22370. But those would be for Python indices, and did not contemplate exposing UX in
adhoc_tool
et al. for having environment variables which did not contribute to the cache key. As I mentioned before, this would need to be a change in how Pants executes processes. You would need a way to supply environment variables to the local executor which did not contribute to the cache key which is the hash of most of the
Process
dataclass internally in the rules engine. (Note that description is excluded for example from comparisons.) https://github.com/pantsbuild/pants/blob/4abf65d0b989d8c10d995f204076f4e6e25ae87f/src/python/pants/engine/process.py#L106 And designing this needs to take into account how the engine de-duplicates execution requests and only computes the result once per session and/or stores it in the local cache. So it is not just a simple matter of making the
env
field on
Process
not take part in comparisons.
The work around suggested by @wide-midnight-78598 is to stash these environment variables in a file in the repo and use an
experimental_workspace_environment
as the
environment
for the execution. You would then source the file into the shell to get them. This side steps the cache key issue because the environment variables are never set in the
Process
dataclass for your
adhoc_tool
execution.
And the shell commands being executed would just have the filename being sourced. The shell text is part of the cache key, not the files it is sourcing from the workspace environment (unlike the local environment where Pants would need to know about the input file to put it in the sandbox and thus hash it into the cache key).
Basically "tricking" Pants into pulling in data which it did not incorporate into the cache key. (And if you look at the PRs, you will see a similar trick being performed at the Pants rules level.)
f
I have not had any issues with the python indices, as it seems pantsbuild allows pip to pick up
~/.netrc
. Whilst the hack you propose may work, I don't see feasible, it's a very complicated solution, to solve a problem which I think the build tool itself should have support for. While running remote execution is not a priority for me today, It is something I would like to use, how does that work? Surely it will not use my local
~/.netrc
file? does the remote server need to have that file configured? I cannot believe this to be a "new problem"?
While searching around for a solution, I did stumble upon https://pypi.org/project/pants-backend-secrets/ though, documentation is limited, and my "silly" guess of how to use it, does not seem to yield any working results. Here's my demo, which flat-out does not work: https://github.com/Olindholm/pantsbuild-credential-demo/tree/env_secret Was hoping @gorgeous-winter-99296 could have a look, and point me in the right direction?
g
I am on parental leave and have very little computer time but I'll try to look at it during naptime the after the weekend :)