I'm finding `terraform init` may not cache stably ...
# general
f
I'm finding
terraform init
may not cache stably from 2.23. It can pull in a terraform lockfile from cache with mismatching checksums from if you ran terraform init, meaning it can also fail
terraform validate
if pulling that step from cache, but succeed without cache. I'm wondering if it doesn't consider the lockfile step for how it's being cached
I don't have a great repro, but I can see in my workspace: • disable local and remote cache = lockfile with correct hash is present in sandbox and
terraform validate
succeeds • with cache = lockfile has the wrong hash (for providers in the lockfile) in sandbox and thus fails • touch the terraform source in any way (add a new line): does not pull the bad lockfile from cache and succeeds I think it's the cache hits on terraform modules that is not covering enough scope
h
And this is new in 2.23?
f
Yes it doesn't repro on 2.22, I noticed it while upgrading from 2.22, which is what wrote the remote cache I then tested against. So not entirely sure if it's some transitional thing or a regression
But I saw there was some work in 2.23 on specifically terraform so maybe that could be related
h
cc @careful-address-89803 Presumably related to https://github.com/pantsbuild/pants/pull/21221 ?
c
I implemented terraform caching in that MR using terraform's ability to locally cache downloaded modules and a Named Cache. If I understand the steps to reproduce: 1. Run Pants in a way that `init`s terraform 2. Update the lockfile but not the sources It sounds like the lockfile isn't used for computing the cache key to the rule that runs
terraform init
so it returns the digest with the old providers. Lemme dig into that
That looks like the most likely cause. get_terraform_providers which actually dispatches the
terraform init
invocation has the lockfile in the request (and therefore the cache key). But most requests are for terraform_init which only has the root module and dependencies in the cache key. I wonder if marking
terraform_init
as noncacheable is the only real solution, since the lockfile is only looked for and fetched in that rule. Let me try to make a reproducer and see.
I can't get this to repro, Pants keeps rerunning tf init. Can you give me more details on how you're updating the lockfile? I only have a local cache, can you still repro this if you disable the remote cache? I can also make a few fixes and you can try it out, but that sounds like a lot of work for both of us
f
Yeah I can repro with remote cache disabled, I'll see if I can get some more specifics for a repro to share
Ah I think I understand the difference with the repro, it's because
pants check --only terraform-validate
does not cause a
terraform init
and pulls that from cache unless sources are changed. And then it runs
terraform validate
without having the right providers. And if changing the source (like a line break) it triggers a run of
terraform init
before
terraform validate
and then it is good
So as I think you guessed the cache key of init seem insufficient
I don't know if maybe having cache written by 2.22 and then consumed by 2.23 could be a factor
@careful-address-89803 any ideas if it could affect transition from 2.22-2.23 cache-wise? It's not making complete sense to me yet. It fails unless init, but also does not init consistently. But it's not obvious where that forks the wrong way
c
I'm honestly not sure. I've only done a little work with caching, not enough to say definitively. I would think that the Pants version shouldn't affect the cache because the rule's input is the same. It makes sense to me that not reinitialising would result in failure, but I'm not sure why it would sporadically init or not. Can you give me more details on how you're modifying the lockfile? I tried just
mv
ing a new version in but Pants seems to reinitialise. Also if the problem is bad entries in the cache, you can try using a different cache dir temporarily and see if you can get a repro. Or you can wipe the cache and if this is caused by bad entries or a switch to a new Pants version they'll be purged.
f
I think the case is we don't have stable lockfiles, that would explain it grabbing any other version of providers and then look okay, to pants, but break on checksum with terraform. I'll see if that is why it's happening here
Especially. if source files are giving approximate version requirements
So the cache lookup based on sources are doing their best, but there's no lockfile to confirm, that should fail with them not being able to realise what providers to pull in, rather than to use whatever is in cache and then fail on checksum on terraform-level, that would reduce confusion I think
@careful-address-89803 above ☝️
It's interesting in the sense that maybe if I ask for version 2.x of some provider it breaks against 2.7 if I had 2.9 locally, or some similar process to that
There must be a condition to when causing inits, and that condition is not properly hermetic
That's the only thing I can think of, it's difficult to dial down to a specific case. It is not stable across 2.22 and 2.23 is what I can see on repros, but killing cache and going from 2.23 is fine. The question is when that will happen again
Since wiping cache can be painful for large scale applications of pants
c
when you say "we don't have stable lockfiles", do you mean that you don't commit the
.terraform.lock.hcl
to git? Do you have them ignored in gitignore or pants_ignore? Do you manually add the lockfile as a
file
or
resource
dependency? One of the changes from 2.22 to 2.23 was that lockfiles went from a file pulled in with a
PathGlobs
to a synthetic target. As a target it propagates its changed status and is pulled in as a dependency. I'm not sure if that would have an impact, though. If anything, I would expect that to fix it.
also to clarify, it's not that Pants is pulling in a cached version of the lockfile, it's that Pants in pulling in a cached result of initialising terraform. Pants's behaviour for
terraform_modules
with no lockfile is to defer to
terraform init
which allows running without a lockfile. This will generate a lockfile which will be used later for
terraform validate
, but will not be written to the workspace.
also when you say "if I ask for version 2.x of some provider it breaks against 2.7 if I had 2.9 locally", what do you mean? Do you mean that if your lockfile has
version = "2.9"
, the error says that the Terraform has the provider files for version 2.7? or something else?
I also still can't get a reproduction. Here's my setup: pants.toml
Copy code
[GLOBAL]
#pants_version = "2.22.0"
pants_version = "2.23.0"

backend_packages.add = [
      "pants.backend.experimental.terraform",
      "pants.backend.python",
]

[python]
interpreter_constraints = ["==3.9.*"]
BUILD
Copy code
terraform_module(name="r")
main.tf
Copy code
terraform {
  backend "local" {
    path = "/tmp/will/not/exist"
  }
  required_providers {
    null = {
      source  = "hashicorp/null"
      version = "~>3.2.0"
    }
  }
}
I have versions of the
.terraform.lock.hcl
with the hashes for either 2.3.0 or 2.3.2. Here are my steps: 1. clear the Pants cache 2.
cp .terraform.lock.old .terraform.lock.hcl
3. set the pants version to 2.22 4.
pants check --only=terraform-validate ::
5. kill pantsd 6.
cp .terraform.lock.new .terraform.lock.hcl
7. set the pants version to 2.23 8.
pants check --only=terraform-validate ::
It reruns
terraform init
every time. If I increase the provider version (eg to
~>3.2.3
) and then try rolling back to the old lockfile, it tries to reinitialise and fails (because the lockfile doesn't ahve any providers that match). Does this demo look roughly like your situation? Can you spot a key difference, besides size?
I think I didn't read this closely enough: is the error you're getting (for your module/provider)?
Error: registry.terraform.io/hashicorp/azurerm: the cached package for registry.terraform.io/hashicorp/azurerm 4.14.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file?
I thought the error sounded like a provider version mismatch, not whatever that error is. If that's the error you're getting, I have a case of it too. This is happening with one of my modules which doesn't have a lockfile checked in to git (sounds like your case). Poking around the sandbox, the provider itself has the correct sha256sum, but the lockfile has an incorrect hash (and only the H1 is different)! Regenerating the lockfile in the sandbox gives the correct hash (tf adds it instead of replacing).
I have a feeling it's because of this note on the https://developer.hashicorp.com/terraform/cli/config/config-file#provider-plugin-cache
Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.
Which is exactly what Pants does with
check ::
f
Sorry for the late reply
when you say "we don't have stable lockfiles", do you mean that you don't commit the
.terraform.lock.hcl
to git
Yeah exactly, ignored in .gitignore The error is indeed
Error: registry.terraform.io/grafana/grafana: the cached package for registry.terraform.io/grafana/grafana 2.9.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
It sounds like you're onto something to me, thanks for the PR, I can try the branch out if you want
@careful-address-89803 I experimented a bit using your branch for that PR on the same workspace as I initially ran into it with. Unfortunately it didn't fix the error even without lockfiles, it still is having troubles which I think is what you linked with it doing concurrent
terraform init
which behavior is undefined. (The project I'm testing on is ~30 modules). It's fine with manually initing each, but via pants it's just not stable. It does seem fine if I via pants make sure it only causes an init for one module at a time. So it seems to me like it's the concurrency-aspect of terraform init. I think I managed to get it stable for my environment across all the modules by just committing to the lockfiles (as one should anyways). However, for that I can't find a way to include hashes for other platforms, like generating them on MacOS but include hashes for amd64 to be viable in CI/Deployments (like you would with
terraform providers lock -platform=linux_amd64
) but that's of course a separate topic. So in summary, I'm not sure where it really goes wrong honestly, but with lockfiles it's acting more stable, and without it it's inconsistently breaking hashes quite "randomly", such as some module getting the wrong hash for a provider that was successfully gotten from cache for the same version by some other module during the same pants-command, that caused init of both of them.
c
Thanks so much for trying it! I missed lockfile generation, it was still using the cache 🤦. I've pushed another commit that should skip the cache when generating lockfiles. Sorry about that! You should be able to generate a cross-platform lockfiles with the download-terraform.platforms setting. Your mention that you had 30 modules made me think that I haven't tested it with more modules than I have cores. Doing that, I get a different error if the cache hasn't been created yet (
~/.cache/pants/named_caches/terraform_plugins/
). Something like:
Error: The specified plugin cache dir /tmp/pants-sandbox-kvwP0x/__terraform_filesystem_mirror cannot be opened: stat /tmp/pants-sandbox-kvwP0x/__terraform_filesystem_mirror: no such file or directory
Let me know if that's the error you're hitting or if it's still the original one.
f
Aah, okay cool! I'll give it a go again. Yeah I haven't seen that error I'm pretty sure, it's been chugging fairly happily through all the modules, only the provider-cache-checksum-situation have been popping up recently
c
Aha! I have a repro using 2x the modules as cores. I'll continue testing
👍 1
f
So things that trigger
init
are the same, they just race with the terraform cache I think, and get invalid shas
I'm not fully sure but that seems very fitting in some way, or something race:y somewhere along the line at least
I also miscalculated slightly, it's 76 modules. I'm running it on 24 cores
Generating lockfiles and after that running check works fine. Check without lockfiles is hitting the same issue
Like ~10 out of the ~70 fail on invalid checksum without it, and it seem to be fairly arbitrary which ones
c
yeah it's definitely a race condition. extracting the file is not atomic, so when the H1 hash is taken of all the files in the directory, it's possible for it to be wrong. The current fix is trying to only use the cache when it will be reads only, but I'm becoming less sure that there's a provably correct way to share the cache between all modules. It seems that they're still able to pull in an incorrect lockfile. I think there's a way of writing a lot of code to pre-seed the cache so it's always only reads, but that seems like a lot. A quick fix might be to just split the cache by module. That would make it similar to TF's .terraform directory and would make the performance equivalent to it, which was the main goal of caching. (Without caching it has to reinit for every invocation of
check
). I'm not sure how we'd clean up the cache dir, I'd have to look into that.
f
Yeah, either separate caches, or making them run sequentially if they are touching the cache seem to look more and more necessary
Ah, do we have any assumptions that the .terraform cache is per module currently? It is commonly configured to be shared across modules so they can share plugins, is that a potential issue at the moment? 👀
c
maybe I've misunderstood, but I thought that the .terraform directory had to be per-module (well, per root module). up to now, we've been sharing the terraform plugin provider cache. when using that cache, TF will symlink providers from there instead of downloading them into the .terraform directory. Unless I've missed a setting? that would solve this.
we're just setting the
TF_PLUGIN_CACHE_DIR
envvar to use the TF provider cache. If you're passing that through with
[download-terraform].extra_env_vars
they'll conflict
f
Ah yeah I have that set as well via ~/.terraformrc to have a shared cache just in general
which I think terraform look up still
it definitely writes there at least, wiped it at some point while testing
AFAIK there's no need for a per-module .terraform dir, it's pretty much just if you want a provider cache per module locally, otherwise it can be put higher up
Is the only way to force the init:s to not run concurrently to add some lock? Or can it be done with something convenient on the rule graph 😄? Was thinking to try that too to get some better intel on the concurrency-issue-direction
c
ah! so, when pants is "not using the cache", it's just leaving the
TF_PLUGIN_CACHE_DIR
envvar unset, which means that the value if your tfrc will take over. So, effectively, every request is still using a TF cache somewhere. If you try it without that setting in your tfrc, does it still fail?
f
Hmm good point, give me a sec will check
c
Also not needing a .terraform dir per root module (everywhere you'd run
terraform apply
) is new to me. I always thought you needed one per root module because it stores the last backend config and stuff like that. Submodules only need them if you're treating them like root modules (which is what happens with
terraform validate
). Can you link me a doc on how to do that?
f
Yeah it still fails. Ah I specifically mean the provider cache regarding the .terraform dir per root module
For state etc with apply it is a different story, but cache-wise maybe it could matter if it's there or somewhere more global
c
ah, that's unfortunate that it didn't work. I've pushed a commit that uses 1 cache per module. I think if we stick with that we could also use caches for the other operations that we've removed it from in this MR.
f
I've fixed this in my env by just providing lockfiles for everything. That seems to work well, I haven't found any way to make it work without
c
hmm, I wonder if that's because it's not overriding the setting in .terrformrc for some operations, so we still have the shared cache problem.
f
I tested with and without .terraformrc sharing cache as well, it seemed to be hitting some race either way
But what seems the most interesting is the lockfiles that terraform generates unless they already exist, those seem to be broken to generate concurrently. That's something you would rarely do but pants will cause
I would bet on that specific process being the issue there
c
yeah, I'd bet it's a race on the cache as well. I've been trying generating lockfiles locally and it seemed to work, but I'll give it another go. I'm hoping that forcing an individual cache in all cases would work.
Can you try it again? Now it creates a cache per-module and always uses it. Hopefully that eliminates the errors.
f
I tried it again, it's still being race:y. But it didn't create a
.terraform.d
for each module, in that case it should right, or did you make it put them somewhere else but per module?
Ah wait, does it handle this?
This directory must already exist before Terraform will cache plugins; Terraform will not create the directory itself.
Maybe it falls back if that fails
Nvm saw it did, hmm
c
It should now be trying to have a separate directory within the single named cache. It uses the subpath to the BUILD file that the terraform_module is in as the subpath within the named cache (something like ~/.cache/pants/named_caches/terraform/path/to/module), and it should be using that cache for all terraform operations. I'm really confused as to what's still happening. It still has the race condition during lockfile generation, right? Any chance you have multiple
terraform_module
in the same BUILD file?
f
Got it. The race tend to be on
terraform validate
(pants check ...) unless there's lockfiles generated. Not so much on generating lockfiles themselves. But both should be doing
init
right
I'm thinking it could depend on the condition of
required_providers
So the scenario was without lockfiles
pants check ::
fails on a bunch of errors from terraform like
there is no package
,
does not match any of the checksums recorded in the dependency lock file
etc. For the exact same source I can
pants generate-lockfiles
and then
pants check ::
successfully. Let me double check your branch again actually
Okay, I must have failed testing last time. I think it's actually fixing it with your last commits with the cache-per-module
Yes @careful-address-89803 🙏🙏 I think that did it actually, tested with a few cache levels like remote/local/none
c
Awesome! That's a relief, I think the next step would have been to learn Golang and try making the cache concurrency-safe upstream 😅
f
Haha 😄 Awesome digging, thank you for figuring it out 🙂
c
You're welcome! Thank you for your patience and persistence trying out the changes!
🙏 1
i
sorry for jumping on this old thread, but i'm seeing intermittent check errors with terraform after upgrading from 2.21 to 2.24:
Copy code
14:50:21.95 [ERROR] Completed: pants.backend.terraform.goals.check.terraform_check - terraform-validate failed (exit code 1).
Partition #1 - `terraform validate` on `api_search/infra:infra`:
Success! The configuration is valid.


Partition #2 - `terraform validate` on `common/infra/terraform:infra`:
╷
│ Error: registry.terraform.io/hashicorp/google: there is no package for registry.terraform.io/hashicorp/google 5.33.0 cached in .terraform/providers
│ 
│ 
╵
could this be related? from this thread, i'm not sure what the solution might be.
c
might be related? the bugfix was backported to 2.24.1 and 2.24.2, can you try one of those?
f
Yes, the backport is on 2.24.1+ I think, so 2.24.0 might have that issue still. Since that minor was cut before the fix
(and 2.23.2+)
i
2.24.2 is resulting in the same error, although it appears to be happening all the time instead of intermittently (could be wrong on that).
f
Do you have terraform lockfiles?
Ah wait
Not sure if related, but there's an issue with hashicorp providers registry-wide in the last 2 days or so, and could be related. As in they can't be fetched for whatever reason
c
Sounds unrelated, then. It sounds like you have a reliable reproduction for it. Can you make a separate thread or GH issue? My first thought is that the failure looks like a shared module, and it's not always correct to run
validate
against those directly (eg they need a "root module" to specify provider versions). Beyond that, we're really just calling
terraform init
and
terraform validate
and setting the providers cache. So I'd expect it's something we'd need to dig into
f
@careful-address-89803 I'm unfortunately seeing these races still. They are much more rare after proper lockfiles, but they do happen still. I wonder if maybe init must be sequenced, or is the terraform cache not isolated or something maybe? It is always fixable by retrying so it is quite telling that it is a race going on still. It feels like it's maybe just the fact that init is not concurrent safe, unless the cache is fully isolated per module. But I'm not sure
c
I've been thinking about this. We do separate the terraform cache per module. One thing I thought of is that a
terraform_module
might be checked by both itself and any `terraform_deployment`s that use it as a root module, and they would share the same cache because it's the same root module. In that case it should be safe to disable the check (I think on the module?)
f
@careful-address-89803 hmm. We have 0
terraform_deployment
for all the modules. Could it be that any concurrent request for the module could cause races down-the-line?
(Say a random plugin, requesting for it)
I'm thinking maybe anything that requests the same module concurrently could be affected. So while we have our own plugin doing it, it probably would be affected the same way as
terraform_deployment
. So maybe that has to be locked down to one operation per module concurrently regardless of who the caller was. I know @happy-kitchen-89482 solved something similar at a lower-level, but maybe has some ideas on how it could be done at this level too
Are the rules not correctly setup 🤷‍♂️ They seem sensible, otherwise I'd just honestly add a lock on these on the target
Correct me if I'm wrong. But there's no concurrency guarantees with
TerraformProcess
, so regardless how it's invoked it's not safe because the tool itself is not safe to invoke concurrently. And there's a bunch of ways that could end up invoked concurrently even with the isolated caches. I think that is the primary remaining issue honestly.
@happy-kitchen-89482 can you flag a tool as non-concurrent, (or "wishes", concurrency group) or similar, or do you have to just enforce that separately?
c
Oh, yeah, I guess if several things requested terraform init for the same terraform module then it wouldn't be concurrency safe. I think that we could get the invocation cached. though I'm not sure why they aren't being cached currently. Maybe because it's not that they'd need to be cached, but essentially deduplicated? Also concurrent invocations of the terraform init should be something we could see with logs (although maybe not with the ones we have now)
I might be able to look into this more on Wednesday. You mentioned that you have a custom plugin, can you share anything more about it?
1
f
Sure! We have a plugin for fluxcd that depends on terraform modules to release them. So I'm thinking things requesting the module to init concurrently can lead into that. It's quite random, maybe 5-10%. It essentially never fails on just a plain
pants check
on the modules. So that''s why I'm thinking it might be able to get into a racey state if you're unlucky with the concurrency with requests towards the modules
The plugin might be causing more concurrent request to trigger it
I can probably get you some logs if that would help
Ah. We (via the plugin) request
TerraformDependenciesField
which in turn probably does an init:
Copy code
@dataclass(frozen=True)
class TerraformInitRequest:
    root_module: TerraformRootModuleField
    dependencies: TerraformDependenciesField
It doesn't fix the init being flakey but probably why it happens
h
AFAICR there is no way to cause a tool to run exclusively, but that should be pretty easy to add
f
AFAICR there is no way to cause a tool to run exclusively, but that should be pretty easy to add
Maybe that would be a nice-to-have either way. Maybe terraform is a bit of a special case (but also not really?) but there's possibly other tools where I can imagine it to be useful too
@careful-address-89803 I can repro the issue when running init on separate modules concurrently, when they want the same providers. Even without involving that plugin. So I wonder if the cache separation is really working as we think it is, or if there's some race going on at a lower level even? @happy-kitchen-89482 Can I force the tasks to run sequentially somehow with the options available today, like on pants global level maybe? Just to see if I can confirm the concurrency of them being the issue
h
You can turn concurrency down to 1 with
<https://www.pantsbuild.org/2.24/reference/global-options#process_execution_local_parallelism>
f
Yeah was experimenting with that, then discovered
--rule-threads-core
--rule-threads-max
could they still cause concurrent execution even with the one you mentioned set to 1?
h
That is concurrency for
@rule
execution in the pants process. To make tool subprocesses run exclusively you want the option I linked to
1
f
Having a really hard time to reproduce it locally, been trying all kinds of combinations and exploring if remote caching could be involved
c
Pants uses the path to the module within the repo as a subfolder within the pants' cache dir to create the TF cache. Is it possible that's contributing to this?
f
Possibly, could it maybe conflict with sub-modules that are within a root module directory that have their own lockfiles? So something like:
Copy code
- rootmodule/
  - BUILD
  - <http://main.tf|main.tf>
  - <http://versions.tf|versions.tf>
  - .terraform.lock.hcl
  - submodule1/
    - BUILD
    - <http://main.tf|main.tf>
    - <http://versions.tf|versions.tf>
    - .terraform.lock.hcl    
  - submodule2/
    - BUILD
    - <http://main.tf|main.tf>
    - <http://versions.tf|versions.tf>
    - .terraform.lock.hcl
So for example in the real case it seems to be that I have remote cache hits for every
terraform init
of rootmodule, submodule1 and submodule2. And a remote cache hit for
terraform validate
for example on rootmodule and submodule1. But then a failure on the cache missed
terraform validate
of submodule2. So, another idea, does the cached
terraform init
maybe not populate the terraform cache so that a subsequent uncached
terraform validate
fails?
@careful-address-89803 As in the
terraform init
step is cached, but because of that it does not fetch providers, and then when
terraform validate
runs on the same module it doesn't have the providers
Maybe
terraform init
should just be uncacheable, I can roughly puzzle that together to be the cause for what I'm seeing, but a bit difficult to be completely sure
c
oh uh I didn't think of submodules nested in a module. I've only seen them in other folders. That will cause their cache dirs to be nested in each other. I don't know if that would cause issues, since the filepaths wouldn't overlap (so ".cache/.../rootmodule/registry.terraform.io/{{ plugins }}" vs ".cache/.../rootmodule/submodule1/registry.terraform.io/{{ plugins }}"). But it's also not helping. Let me write a version that makes the cache dirs nonoverlapping (I'll probably just hash the dir path). I can also make the call to init uncacheable. (I think often init isn't cached because files are modified and those are used as part of the cache key)
I think, though, that remote caching might be the problem here. I'm assuming you have several instances of the remote caching server? Using the local providers cache, the ".terraform" directory will contain symlinks to the providers cache which is inside Pants's named cache (ex "tf/tf4/mod/.terraform/providers/registry.terraform.io/hashicorp/azuread/2.15.0/linux_amd64" -> "/.../.cache/pants/named_caches/terraform_plugins/{{ hash }}/registry.terraform.io/hashicorp/azuread/2.15.0/linux_amd64/"), and that's what gets pulled into the digest. But if there are multiple remote cache servers, it's possible that one server runs
init
which populates the module cache, but the one running
validate
hasn't run
init
so it doesn't have the module cache populated.
If that's the case, I think we'd need to reorganise the rules, since
init
and
validate
must be done in the same workunit if they're using the provider cache. I think you could check if that's the case by comparing the caches across nodes. The paths should be stable.
(also just for documentation, if there isn't a lockfile, TF needs to redownload the providers even if they are present in the module cache)
You could see if https://github.com/pantsbuild/pants/pull/22183 fixes it, but I do think that it's related to remote caching
f
But if there are multiple remote cache servers, it's possible that one server runs
init
which populates the module cache, but the one running
validate
hasn't run
init
so it doesn't have the module cache populated.
Hmm let me just clarify for myself. I have a CI runner executing the work, and using a remote cache, but it's not using remote execution. Do you mean remote execution or only remote caching being affected also? The actual execution is done on github action runners. But here's where I'm thinking even regardless of remote or local cache. Does a cached init step populate the terraform provide cache? Or does it have to always be run? Because that would explain if it works locally because if it's cached then it's also been run, but on CI with remote cache then if it's in the cache it may not have been run. So if the pants cache doesn't populate the terraform cache then it wouldn't work for it to be cached?
@careful-address-89803 it's a bit of brain gymnastics to do this debugging in writing but hopefully it came across somewhat what I'm thinking 😄
If that's the case, I think we'd need to reorganise the rules, since
init
and
validate
must be done in the same workunit if they're using the provider cache.
I think this is really getting to it, or actually any terraform operation that requires the providers must depend on init having been run in the same work unit
So from there I arrived at the (possibly easier option of) maybe just not ever cache the init within pants? Since the terraform provider cache will be used to cache the provider fetches anyways, maybe the init could just be made an uncacheable rule? Or would that kill the provider cache from being kept entirely?
c
I think you are correct, the problem we're seeing is that the digest after
init
contains symlinks (because that's how terraform uses its cache). But the symlinks point into the pants named-cache, which is not replicated across machines. So if system0 runs the init, then system0 has the providers in its named cache (which is local); it then pushes the digest (which contains symlinks to the named cache) to the remote cache. System1 wants to run, it pulls the digest from the cached init down, but that digest has symlinks to providers that aren't in system1's named cache. I think you are correct that we can fix this by always running
init
, either by making it uncacheable or by combining it in the same execution as whatever command will be run after. If the providers are downloaded in terraform's module cache, it should just use them.
f
🙏 that sounds like a good direction then, not sure on pros/cons on which option to pick but it seems very likely either of them will solve the issue
@careful-address-89803 are you already planning to look into a fix or should I give it a go?
c
I can work on it today. I think it would be simple enough to modify TerraformProcess to have a list of commands to run and then run them all in the launcher script
❤️ 1