fierce-truck-19259
12/07/2024, 3:13 AMterraform init
may not cache stably from 2.23. It can pull in a terraform lockfile from cache with mismatching checksums from if you ran terraform init, meaning it can also fail terraform validate
if pulling that step from cache, but succeed without cache. I'm wondering if it doesn't consider the lockfile step for how it's being cachedfierce-truck-19259
12/07/2024, 3:31 AMterraform validate
succeeds
• with cache = lockfile has the wrong hash (for providers in the lockfile) in sandbox and thus fails
• touch the terraform source in any way (add a new line): does not pull the bad lockfile from cache and succeeds
I think it's the cache hits on terraform modules that is not covering enough scopehappy-kitchen-89482
12/07/2024, 5:21 AMfierce-truck-19259
12/07/2024, 5:55 AMfierce-truck-19259
12/07/2024, 5:56 AMhappy-kitchen-89482
12/08/2024, 11:20 PMcareful-address-89803
12/09/2024, 12:06 AMterraform init
so it returns the digest with the old providers. Lemme dig into thatcareful-address-89803
12/09/2024, 2:42 AMterraform init
invocation has the lockfile in the request (and therefore the cache key). But most requests are for terraform_init which only has the root module and dependencies in the cache key.
I wonder if marking terraform_init
as noncacheable is the only real solution, since the lockfile is only looked for and fetched in that rule.
Let me try to make a reproducer and see.careful-address-89803
12/09/2024, 4:46 AMfierce-truck-19259
12/09/2024, 9:11 AMfierce-truck-19259
12/09/2024, 9:23 AMpants check --only terraform-validate
does not cause a terraform init
and pulls that from cache unless sources are changed. And then it runs terraform validate
without having the right providers. And if changing the source (like a line break) it triggers a run of terraform init
before terraform validate
and then it is goodfierce-truck-19259
12/09/2024, 9:24 AMfierce-truck-19259
12/09/2024, 9:39 AMfierce-truck-19259
12/11/2024, 12:28 PMcareful-address-89803
12/16/2024, 3:08 AMmv
ing a new version in but Pants seems to reinitialise.
Also if the problem is bad entries in the cache, you can try using a different cache dir temporarily and see if you can get a repro. Or you can wipe the cache and if this is caused by bad entries or a switch to a new Pants version they'll be purged.fierce-truck-19259
12/16/2024, 6:20 PMfierce-truck-19259
12/16/2024, 6:21 PMfierce-truck-19259
12/16/2024, 6:23 PMfierce-truck-19259
12/16/2024, 6:24 PMfierce-truck-19259
12/16/2024, 6:26 PMfierce-truck-19259
12/16/2024, 6:27 PMfierce-truck-19259
12/16/2024, 6:29 PMfierce-truck-19259
12/16/2024, 6:30 PMcareful-address-89803
12/18/2024, 11:54 PM.terraform.lock.hcl
to git?
Do you have them ignored in gitignore or pants_ignore?
Do you manually add the lockfile as a file
or resource
dependency?
One of the changes from 2.22 to 2.23 was that lockfiles went from a file pulled in with a PathGlobs
to a synthetic target. As a target it propagates its changed status and is pulled in as a dependency. I'm not sure if that would have an impact, though. If anything, I would expect that to fix it.careful-address-89803
12/19/2024, 12:02 AMterraform_modules
with no lockfile is to defer to terraform init
which allows running without a lockfile. This will generate a lockfile which will be used later for terraform validate
, but will not be written to the workspace.careful-address-89803
12/19/2024, 12:11 AMversion = "2.9"
, the error says that the Terraform has the provider files for version 2.7? or something else?careful-address-89803
12/19/2024, 12:30 AM[GLOBAL]
#pants_version = "2.22.0"
pants_version = "2.23.0"
backend_packages.add = [
"pants.backend.experimental.terraform",
"pants.backend.python",
]
[python]
interpreter_constraints = ["==3.9.*"]
BUILD
terraform_module(name="r")
main.tf
terraform {
backend "local" {
path = "/tmp/will/not/exist"
}
required_providers {
null = {
source = "hashicorp/null"
version = "~>3.2.0"
}
}
}
I have versions of the .terraform.lock.hcl
with the hashes for either 2.3.0 or 2.3.2. Here are my steps:
1. clear the Pants cache
2. cp .terraform.lock.old .terraform.lock.hcl
3. set the pants version to 2.22
4. pants check --only=terraform-validate ::
5. kill pantsd
6. cp .terraform.lock.new .terraform.lock.hcl
7. set the pants version to 2.23
8. pants check --only=terraform-validate ::
It reruns terraform init
every time.
If I increase the provider version (eg to ~>3.2.3
) and then try rolling back to the old lockfile, it tries to reinitialise and fails (because the lockfile doesn't ahve any providers that match).
Does this demo look roughly like your situation? Can you spot a key difference, besides size?careful-address-89803
12/23/2024, 8:19 PMError: registry.terraform.io/hashicorp/azurerm: the cached package for registry.terraform.io/hashicorp/azurerm 4.14.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file?
I thought the error sounded like a provider version mismatch, not whatever that error is. If that's the error you're getting, I have a case of it too. This is happening with one of my modules which doesn't have a lockfile checked in to git (sounds like your case). Poking around the sandbox, the provider itself has the correct sha256sum, but the lockfile has an incorrect hash (and only the H1 is different)! Regenerating the lockfile in the sandbox gives the correct hash (tf adds it instead of replacing).
careful-address-89803
12/23/2024, 8:20 PMNote: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.
Which is exactly what Pants does with
check ::
careful-address-89803
01/02/2025, 6:01 AMfierce-truck-19259
01/05/2025, 8:31 PMwhen you say "we don't have stable lockfiles", do you mean that you don't commit theYeah exactly, ignored in .gitignore The error is indeedto git.terraform.lock.hcl
Error: registry.terraform.io/grafana/grafana: the cached package for registry.terraform.io/grafana/grafana 2.9.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock fileIt sounds like you're onto something to me, thanks for the PR, I can try the branch out if you want
fierce-truck-19259
01/08/2025, 6:01 PMterraform init
which behavior is undefined. (The project I'm testing on is ~30 modules). It's fine with manually initing each, but via pants it's just not stable. It does seem fine if I via pants make sure it only causes an init for one module at a time. So it seems to me like it's the concurrency-aspect of terraform init.
I think I managed to get it stable for my environment across all the modules by just committing to the lockfiles (as one should anyways). However, for that I can't find a way to include hashes for other platforms, like generating them on MacOS but include hashes for amd64 to be viable in CI/Deployments (like you would with terraform providers lock -platform=linux_amd64
) but that's of course a separate topic.
So in summary, I'm not sure where it really goes wrong honestly, but with lockfiles it's acting more stable, and without it it's inconsistently breaking hashes quite "randomly", such as some module getting the wrong hash for a provider that was successfully gotten from cache for the same version by some other module during the same pants-command, that caused init of both of them.careful-address-89803
01/08/2025, 10:54 PM~/.cache/pants/named_caches/terraform_plugins/
). Something like:
Error: The specified plugin cache dir /tmp/pants-sandbox-kvwP0x/__terraform_filesystem_mirror cannot be opened: stat /tmp/pants-sandbox-kvwP0x/__terraform_filesystem_mirror: no such file or directoryLet me know if that's the error you're hitting or if it's still the original one.
fierce-truck-19259
01/08/2025, 10:59 PMcareful-address-89803
01/08/2025, 11:19 PMfierce-truck-19259
01/08/2025, 11:19 PMinit
are the same, they just race with the terraform cache I think, and get invalid shasfierce-truck-19259
01/08/2025, 11:20 PMfierce-truck-19259
01/08/2025, 11:29 PMfierce-truck-19259
01/08/2025, 11:31 PMfierce-truck-19259
01/08/2025, 11:32 PMcareful-address-89803
01/08/2025, 11:38 PMcheck
). I'm not sure how we'd clean up the cache dir, I'd have to look into that.fierce-truck-19259
01/08/2025, 11:43 PMfierce-truck-19259
01/08/2025, 11:53 PMcareful-address-89803
01/09/2025, 12:12 AMcareful-address-89803
01/09/2025, 12:14 AMTF_PLUGIN_CACHE_DIR
envvar to use the TF provider cache. If you're passing that through with [download-terraform].extra_env_vars
they'll conflictfierce-truck-19259
01/09/2025, 12:15 AMfierce-truck-19259
01/09/2025, 12:16 AMfierce-truck-19259
01/09/2025, 12:18 AMfierce-truck-19259
01/09/2025, 12:25 AMfierce-truck-19259
01/09/2025, 12:36 AMcareful-address-89803
01/09/2025, 1:04 AMTF_PLUGIN_CACHE_DIR
envvar unset, which means that the value if your tfrc will take over. So, effectively, every request is still using a TF cache somewhere. If you try it without that setting in your tfrc, does it still fail?fierce-truck-19259
01/09/2025, 1:10 AMcareful-address-89803
01/09/2025, 1:12 AMterraform apply
) is new to me. I always thought you needed one per root module because it stores the last backend config and stuff like that. Submodules only need them if you're treating them like root modules (which is what happens with terraform validate
).
Can you link me a doc on how to do that?fierce-truck-19259
01/09/2025, 1:24 AMfierce-truck-19259
01/09/2025, 1:29 AMcareful-address-89803
01/09/2025, 2:03 AMfierce-truck-19259
01/09/2025, 7:35 PMcareful-address-89803
01/12/2025, 5:07 PMfierce-truck-19259
01/13/2025, 7:09 PMfierce-truck-19259
01/13/2025, 7:13 PMfierce-truck-19259
01/13/2025, 7:13 PMcareful-address-89803
01/19/2025, 3:06 AMcareful-address-89803
01/19/2025, 4:48 AMfierce-truck-19259
01/22/2025, 9:42 PM.terraform.d
for each module, in that case it should right, or did you make it put them somewhere else but per module?fierce-truck-19259
01/22/2025, 9:50 PMThis directory must already exist before Terraform will cache plugins; Terraform will not create the directory itself.
Maybe it falls back if that failsfierce-truck-19259
01/22/2025, 9:53 PMcareful-address-89803
01/22/2025, 10:12 PMterraform_module
in the same BUILD file?fierce-truck-19259
01/22/2025, 10:18 PMterraform validate
(pants check ...) unless there's lockfiles generated. Not so much on generating lockfiles themselves. But both should be doing init
rightfierce-truck-19259
01/22/2025, 10:20 PMrequired_providers
fierce-truck-19259
01/22/2025, 10:26 PMpants check ::
fails on a bunch of errors from terraform like there is no package
, does not match any of the checksums recorded in the dependency lock file
etc. For the exact same source I can pants generate-lockfiles
and then pants check ::
successfully.
Let me double check your branch again actuallyfierce-truck-19259
01/22/2025, 10:29 PMfierce-truck-19259
01/22/2025, 10:38 PMcareful-address-89803
01/22/2025, 10:41 PMfierce-truck-19259
01/22/2025, 10:42 PMcareful-address-89803
01/22/2025, 10:51 PMincalculable-france-51377
02/19/2025, 3:22 PM14:50:21.95 [ERROR] Completed: pants.backend.terraform.goals.check.terraform_check - terraform-validate failed (exit code 1).
Partition #1 - `terraform validate` on `api_search/infra:infra`:
Success! The configuration is valid.
Partition #2 - `terraform validate` on `common/infra/terraform:infra`:
╷
│ Error: registry.terraform.io/hashicorp/google: there is no package for registry.terraform.io/hashicorp/google 5.33.0 cached in .terraform/providers
│
│
╵
could this be related? from this thread, i'm not sure what the solution might be.careful-address-89803
02/19/2025, 9:19 PMfierce-truck-19259
02/19/2025, 10:32 PMfierce-truck-19259
02/19/2025, 10:33 PMincalculable-france-51377
02/20/2025, 4:30 PMfierce-truck-19259
02/20/2025, 5:00 PMfierce-truck-19259
02/20/2025, 5:00 PMfierce-truck-19259
02/20/2025, 5:02 PMcareful-address-89803
02/21/2025, 9:27 PMvalidate
against those directly (eg they need a "root module" to specify provider versions). Beyond that, we're really just calling terraform init
and terraform validate
and setting the providers cache. So I'd expect it's something we'd need to dig intofierce-truck-19259
03/25/2025, 7:53 PMcareful-address-89803
03/28/2025, 2:24 AMterraform_module
might be checked by both itself and any `terraform_deployment`s that use it as a root module, and they would share the same cache because it's the same root module. In that case it should be safe to disable the check (I think on the module?)fierce-truck-19259
03/29/2025, 9:21 PMterraform_deployment
for all the modules. Could it be that any concurrent request for the module could cause races down-the-line?fierce-truck-19259
03/29/2025, 9:22 PMfierce-truck-19259
03/29/2025, 9:27 PMterraform_deployment
. So maybe that has to be locked down to one operation per module concurrently regardless of who the caller was. I know @happy-kitchen-89482 solved something similar at a lower-level, but maybe has some ideas on how it could be done at this level toofierce-truck-19259
03/29/2025, 9:32 PMfierce-truck-19259
03/30/2025, 1:43 AMTerraformProcess
, so regardless how it's invoked it's not safe because the tool itself is not safe to invoke concurrently. And there's a bunch of ways that could end up invoked concurrently even with the isolated caches. I think that is the primary remaining issue honestly.fierce-truck-19259
03/30/2025, 1:54 AMcareful-address-89803
03/30/2025, 2:44 AMcareful-address-89803
03/30/2025, 2:46 AMfierce-truck-19259
03/30/2025, 2:55 AMpants check
on the modules. So that''s why I'm thinking it might be able to get into a racey state if you're unlucky with the concurrency with requests towards the modulesfierce-truck-19259
03/30/2025, 2:57 AMfierce-truck-19259
03/30/2025, 3:02 AMfierce-truck-19259
03/30/2025, 3:28 AMTerraformDependenciesField
which in turn probably does an init:
@dataclass(frozen=True)
class TerraformInitRequest:
root_module: TerraformRootModuleField
dependencies: TerraformDependenciesField
It doesn't fix the init being flakey but probably why it happenshappy-kitchen-89482
03/30/2025, 8:38 PMfierce-truck-19259
04/07/2025, 5:59 PMAFAICR there is no way to cause a tool to run exclusively, but that should be pretty easy to addMaybe that would be a nice-to-have either way. Maybe terraform is a bit of a special case (but also not really?) but there's possibly other tools where I can imagine it to be useful too
fierce-truck-19259
04/07/2025, 8:11 PMhappy-kitchen-89482
04/07/2025, 11:26 PM<https://www.pantsbuild.org/2.24/reference/global-options#process_execution_local_parallelism>
fierce-truck-19259
04/07/2025, 11:30 PM--rule-threads-core
--rule-threads-max
could they still cause concurrent execution even with the one you mentioned set to 1?happy-kitchen-89482
04/07/2025, 11:57 PM@rule
execution in the pants process. To make tool subprocesses run exclusively you want the option I linked tofierce-truck-19259
04/08/2025, 2:15 PMcareful-address-89803
04/08/2025, 3:23 PMfierce-truck-19259
04/08/2025, 8:31 PM- rootmodule/
- BUILD
- <http://main.tf|main.tf>
- <http://versions.tf|versions.tf>
- .terraform.lock.hcl
- submodule1/
- BUILD
- <http://main.tf|main.tf>
- <http://versions.tf|versions.tf>
- .terraform.lock.hcl
- submodule2/
- BUILD
- <http://main.tf|main.tf>
- <http://versions.tf|versions.tf>
- .terraform.lock.hcl
fierce-truck-19259
04/08/2025, 8:40 PMterraform init
of rootmodule, submodule1 and submodule2. And a remote cache hit for terraform validate
for example on rootmodule and submodule1. But then a failure on the cache missed terraform validate
of submodule2.
So, another idea, does the cached terraform init
maybe not populate the terraform cache so that a subsequent uncached terraform validate
fails?fierce-truck-19259
04/08/2025, 8:41 PMterraform init
step is cached, but because of that it does not fetch providers, and then when terraform validate
runs on the same module it doesn't have the providersfierce-truck-19259
04/08/2025, 8:49 PMterraform init
should just be uncacheable, I can roughly puzzle that together to be the cause for what I'm seeing, but a bit difficult to be completely surecareful-address-89803
04/10/2025, 1:27 AMcareful-address-89803
04/10/2025, 3:01 AMinit
which populates the module cache, but the one running validate
hasn't run init
so it doesn't have the module cache populated.careful-address-89803
04/10/2025, 3:04 AMinit
and validate
must be done in the same workunit if they're using the provider cache. I think you could check if that's the case by comparing the caches across nodes. The paths should be stable.careful-address-89803
04/10/2025, 3:05 AMcareful-address-89803
04/10/2025, 3:16 AMfierce-truck-19259
04/11/2025, 5:42 PMBut if there are multiple remote cache servers, it's possible that one server runsHmm let me just clarify for myself. I have a CI runner executing the work, and using a remote cache, but it's not using remote execution. Do you mean remote execution or only remote caching being affected also? The actual execution is done on github action runners. But here's where I'm thinking even regardless of remote or local cache. Does a cached init step populate the terraform provide cache? Or does it have to always be run? Because that would explain if it works locally because if it's cached then it's also been run, but on CI with remote cache then if it's in the cache it may not have been run. So if the pants cache doesn't populate the terraform cache then it wouldn't work for it to be cached?which populates the module cache, but the one runninginit
hasn't runvalidate
so it doesn't have the module cache populated.init
fierce-truck-19259
04/11/2025, 5:43 PMfierce-truck-19259
04/11/2025, 5:47 PMIf that's the case, I think we'd need to reorganise the rules, sinceI think this is really getting to it, or actually any terraform operation that requires the providers must depend on init having been run in the same work unitandinit
must be done in the same workunit if they're using the provider cache.validate
fierce-truck-19259
04/11/2025, 10:08 PMcareful-address-89803
04/22/2025, 12:25 AMinit
contains symlinks (because that's how terraform uses its cache). But the symlinks point into the pants named-cache, which is not replicated across machines. So if system0 runs the init, then system0 has the providers in its named cache (which is local); it then pushes the digest (which contains symlinks to the named cache) to the remote cache. System1 wants to run, it pulls the digest from the cached init down, but that digest has symlinks to providers that aren't in system1's named cache.
I think you are correct that we can fix this by always running init
, either by making it uncacheable or by combining it in the same execution as whatever command will be run after. If the providers are downloaded in terraform's module cache, it should just use them.fierce-truck-19259
04/22/2025, 8:19 AMfierce-truck-19259
04/22/2025, 3:57 PMcareful-address-89803
04/23/2025, 3:33 PMcareful-address-89803
04/24/2025, 2:39 AM