I m finding `terraform init` may not cache stably from 2 23 Pants #general

I'm finding `terraform init` may not cache stably ...

fierce-truck-19259

12/07/2024, 3:13 AM

I'm finding

terraform init

may not cache stably from 2.23. It can pull in a terraform lockfile from cache with mismatching checksums from if you ran terraform init, meaning it can also fail

terraform validate

if pulling that step from cache, but succeed without cache. I'm wondering if it doesn't consider the lockfile step for how it's being cached

fierce-truck-19259

12/07/2024, 3:31 AM

I don't have a great repro, but I can see in my workspace: • disable local and remote cache = lockfile with correct hash is present in sandbox and

terraform validate

succeeds • with cache = lockfile has the wrong hash (for providers in the lockfile) in sandbox and thus fails • touch the terraform source in any way (add a new line): does not pull the bad lockfile from cache and succeeds I think it's the cache hits on terraform modules that is not covering enough scope

happy-kitchen-89482

12/07/2024, 5:21 AM

And this is new in 2.23?

fierce-truck-19259

12/07/2024, 5:55 AM

Yes it doesn't repro on 2.22, I noticed it while upgrading from 2.22, which is what wrote the remote cache I then tested against. So not entirely sure if it's some transitional thing or a regression

fierce-truck-19259

12/07/2024, 5:56 AM

But I saw there was some work in 2.23 on specifically terraform so maybe that could be related

happy-kitchen-89482

12/08/2024, 11:20 PM

cc @careful-address-89803 Presumably related to https://github.com/pantsbuild/pants/pull/21221 ?

careful-address-89803

12/09/2024, 12:06 AM

I implemented terraform caching in that MR using terraform's ability to locally cache downloaded modules and a Named Cache. If I understand the steps to reproduce: 1. Run Pants in a way that `init`s terraform 2. Update the lockfile but not the sources It sounds like the lockfile isn't used for computing the cache key to the rule that runs

terraform init

so it returns the digest with the old providers. Lemme dig into that

careful-address-89803

12/09/2024, 2:42 AM

That looks like the most likely cause. get_terraform_providers which actually dispatches the

terraform init

invocation has the lockfile in the request (and therefore the cache key). But most requests are for terraform_init which only has the root module and dependencies in the cache key. I wonder if marking

terraform_init

as noncacheable is the only real solution, since the lockfile is only looked for and fetched in that rule. Let me try to make a reproducer and see.

careful-address-89803

12/09/2024, 4:46 AM

I can't get this to repro, Pants keeps rerunning tf init. Can you give me more details on how you're updating the lockfile? I only have a local cache, can you still repro this if you disable the remote cache? I can also make a few fixes and you can try it out, but that sounds like a lot of work for both of us

fierce-truck-19259

12/09/2024, 9:11 AM

Yeah I can repro with remote cache disabled, I'll see if I can get some more specifics for a repro to share

fierce-truck-19259

12/09/2024, 9:23 AM

Ah I think I understand the difference with the repro, it's because

pants check --only terraform-validate

does not cause a

terraform init

and pulls that from cache unless sources are changed. And then it runs

terraform validate

without having the right providers. And if changing the source (like a line break) it triggers a run of

terraform init

before

terraform validate

and then it is good

fierce-truck-19259

12/09/2024, 9:24 AM

So as I think you guessed the cache key of init seem insufficient

fierce-truck-19259

12/09/2024, 9:39 AM

I don't know if maybe having cache written by 2.22 and then consumed by 2.23 could be a factor

fierce-truck-19259

12/11/2024, 12:28 PM

@careful-address-89803 any ideas if it could affect transition from 2.22-2.23 cache-wise? It's not making complete sense to me yet. It fails unless init, but also does not init consistently. But it's not obvious where that forks the wrong way

careful-address-89803

12/16/2024, 3:08 AM

I'm honestly not sure. I've only done a little work with caching, not enough to say definitively. I would think that the Pants version shouldn't affect the cache because the rule's input is the same. It makes sense to me that not reinitialising would result in failure, but I'm not sure why it would sporadically init or not. Can you give me more details on how you're modifying the lockfile? I tried just

mv

ing a new version in but Pants seems to reinitialise. Also if the problem is bad entries in the cache, you can try using a different cache dir temporarily and see if you can get a repro. Or you can wipe the cache and if this is caused by bad entries or a switch to a new Pants version they'll be purged.

fierce-truck-19259

12/16/2024, 6:20 PM

I think the case is we don't have stable lockfiles, that would explain it grabbing any other version of providers and then look okay, to pants, but break on checksum with terraform. I'll see if that is why it's happening here

fierce-truck-19259

12/16/2024, 6:21 PM

Especially. if source files are giving approximate version requirements

fierce-truck-19259

12/16/2024, 6:23 PM

So the cache lookup based on sources are doing their best, but there's no lockfile to confirm, that should fail with them not being able to realise what providers to pull in, rather than to use whatever is in cache and then fail on checksum on terraform-level, that would reduce confusion I think

fierce-truck-19259

12/16/2024, 6:24 PM

@careful-address-89803 above ☝️

fierce-truck-19259

12/16/2024, 6:26 PM

It's interesting in the sense that maybe if I ask for version 2.x of some provider it breaks against 2.7 if I had 2.9 locally, or some similar process to that

fierce-truck-19259

12/16/2024, 6:27 PM

There must be a condition to when causing inits, and that condition is not properly hermetic

fierce-truck-19259

12/16/2024, 6:29 PM

That's the only thing I can think of, it's difficult to dial down to a specific case. It is not stable across 2.22 and 2.23 is what I can see on repros, but killing cache and going from 2.23 is fine. The question is when that will happen again

fierce-truck-19259

12/16/2024, 6:30 PM

Since wiping cache can be painful for large scale applications of pants

careful-address-89803

12/18/2024, 11:54 PM

when you say "we don't have stable lockfiles", do you mean that you don't commit the

.terraform.lock.hcl

to git? Do you have them ignored in gitignore or pants_ignore? Do you manually add the lockfile as a

file

resource

dependency? One of the changes from 2.22 to 2.23 was that lockfiles went from a file pulled in with a

PathGlobs

to a synthetic target. As a target it propagates its changed status and is pulled in as a dependency. I'm not sure if that would have an impact, though. If anything, I would expect that to fix it.

careful-address-89803

12/19/2024, 12:02 AM

also to clarify, it's not that Pants is pulling in a cached version of the lockfile, it's that Pants in pulling in a cached result of initialising terraform. Pants's behaviour for

terraform_modules

with no lockfile is to defer to

terraform init

which allows running without a lockfile. This will generate a lockfile which will be used later for

terraform validate

, but will not be written to the workspace.

careful-address-89803

12/19/2024, 12:11 AM

also when you say "if I ask for version 2.x of some provider it breaks against 2.7 if I had 2.9 locally", what do you mean? Do you mean that if your lockfile has

version = "2.9"

, the error says that the Terraform has the provider files for version 2.7? or something else?

careful-address-89803

12/19/2024, 12:30 AM

I also still can't get a reproduction. Here's my setup: pants.toml

Copy code

[GLOBAL]
#pants_version = "2.22.0"
pants_version = "2.23.0"

backend_packages.add = [
      "pants.backend.experimental.terraform",
      "pants.backend.python",
]

[python]
interpreter_constraints = ["==3.9.*"]

BUILD

Copy code

terraform_module(name="r")

main.tf

Copy code

terraform {
  backend "local" {
    path = "/tmp/will/not/exist"
  }
  required_providers {
    null = {
      source  = "hashicorp/null"
      version = "~>3.2.0"
    }
  }
}

I have versions of the

.terraform.lock.hcl

with the hashes for either 2.3.0 or 2.3.2. Here are my steps: 1. clear the Pants cache 2.

cp .terraform.lock.old .terraform.lock.hcl

3. set the pants version to 2.22 4.

pants check --only=terraform-validate ::

5. kill pantsd 6.

cp .terraform.lock.new .terraform.lock.hcl

7. set the pants version to 2.23 8.

pants check --only=terraform-validate ::

It reruns

terraform init

every time. If I increase the provider version (eg to

~>3.2.3

) and then try rolling back to the old lockfile, it tries to reinitialise and fails (because the lockfile doesn't ahve any providers that match). Does this demo look roughly like your situation? Can you spot a key difference, besides size?

careful-address-89803

12/23/2024, 8:19 PM

I think I didn't read this closely enough: is the error you're getting (for your module/provider)?

Error: registry.terraform.io/hashicorp/azurerm: the cached package for registry.terraform.io/hashicorp/azurerm 4.14.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file?

I thought the error sounded like a provider version mismatch, not whatever that error is. If that's the error you're getting, I have a case of it too. This is happening with one of my modules which doesn't have a lockfile checked in to git (sounds like your case). Poking around the sandbox, the provider itself has the correct sha256sum, but the lockfile has an incorrect hash (and only the H1 is different)! Regenerating the lockfile in the sandbox gives the correct hash (tf adds it instead of replacing).

careful-address-89803

12/23/2024, 8:20 PM

I have a feeling it's because of this note on the https://developer.hashicorp.com/terraform/cli/config/config-file#provider-plugin-cache

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

Which is exactly what Pants does with

check ::

careful-address-89803

01/02/2025, 6:01 AM

issue at https://github.com/pantsbuild/pants/issues/21804 and MR at https://github.com/pantsbuild/pants/pull/21805 .

fierce-truck-19259

01/05/2025, 8:31 PM

Sorry for the late reply

when you say "we don't have stable lockfiles", do you mean that you don't commit the
.terraform.lock.hcl
to git

Yeah exactly, ignored in .gitignore The error is indeed

Error: registry.terraform.io/grafana/grafana: the cached package for registry.terraform.io/grafana/grafana 2.9.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file

It sounds like you're onto something to me, thanks for the PR, I can try the branch out if you want

fierce-truck-19259

01/08/2025, 6:01 PM

@careful-address-89803 I experimented a bit using your branch for that PR on the same workspace as I initially ran into it with. Unfortunately it didn't fix the error even without lockfiles, it still is having troubles which I think is what you linked with it doing concurrent

terraform init

which behavior is undefined. (The project I'm testing on is ~30 modules). It's fine with manually initing each, but via pants it's just not stable. It does seem fine if I via pants make sure it only causes an init for one module at a time. So it seems to me like it's the concurrency-aspect of terraform init. I think I managed to get it stable for my environment across all the modules by just committing to the lockfiles (as one should anyways). However, for that I can't find a way to include hashes for other platforms, like generating them on MacOS but include hashes for amd64 to be viable in CI/Deployments (like you would with

terraform providers lock -platform=linux_amd64

) but that's of course a separate topic. So in summary, I'm not sure where it really goes wrong honestly, but with lockfiles it's acting more stable, and without it it's inconsistently breaking hashes quite "randomly", such as some module getting the wrong hash for a provider that was successfully gotten from cache for the same version by some other module during the same pants-command, that caused init of both of them.

careful-address-89803

01/08/2025, 10:54 PM

Thanks so much for trying it! I missed lockfile generation, it was still using the cache 🤦. I've pushed another commit that should skip the cache when generating lockfiles. Sorry about that! You should be able to generate a cross-platform lockfiles with the download-terraform.platforms setting. Your mention that you had 30 modules made me think that I haven't tested it with more modules than I have cores. Doing that, I get a different error if the cache hasn't been created yet (

~/.cache/pants/named_caches/terraform_plugins/

). Something like:

Error: The specified plugin cache dir /tmp/pants-sandbox-kvwP0x/__terraform_filesystem_mirror cannot be opened: stat /tmp/pants-sandbox-kvwP0x/__terraform_filesystem_mirror: no such file or directory

Let me know if that's the error you're hitting or if it's still the original one.

fierce-truck-19259

01/08/2025, 10:59 PM

Aah, okay cool! I'll give it a go again. Yeah I haven't seen that error I'm pretty sure, it's been chugging fairly happily through all the modules, only the provider-cache-checksum-situation have been popping up recently

careful-address-89803

01/08/2025, 11:19 PM

Aha! I have a repro using 2x the modules as cores. I'll continue testing

👍 1

fierce-truck-19259

01/08/2025, 11:19 PM

So things that trigger

init

are the same, they just race with the terraform cache I think, and get invalid shas

fierce-truck-19259

01/08/2025, 11:20 PM

I'm not fully sure but that seems very fitting in some way, or something race:y somewhere along the line at least

fierce-truck-19259

01/08/2025, 11:29 PM

I also miscalculated slightly, it's 76 modules. I'm running it on 24 cores

fierce-truck-19259

01/08/2025, 11:31 PM

Generating lockfiles and after that running check works fine. Check without lockfiles is hitting the same issue

fierce-truck-19259

01/08/2025, 11:32 PM

Like ~10 out of the ~70 fail on invalid checksum without it, and it seem to be fairly arbitrary which ones

careful-address-89803

01/08/2025, 11:38 PM

yeah it's definitely a race condition. extracting the file is not atomic, so when the H1 hash is taken of all the files in the directory, it's possible for it to be wrong. The current fix is trying to only use the cache when it will be reads only, but I'm becoming less sure that there's a provably correct way to share the cache between all modules. It seems that they're still able to pull in an incorrect lockfile. I think there's a way of writing a lot of code to pre-seed the cache so it's always only reads, but that seems like a lot. A quick fix might be to just split the cache by module. That would make it similar to TF's .terraform directory and would make the performance equivalent to it, which was the main goal of caching. (Without caching it has to reinit for every invocation of

check

). I'm not sure how we'd clean up the cache dir, I'd have to look into that.

fierce-truck-19259

01/08/2025, 11:43 PM

Yeah, either separate caches, or making them run sequentially if they are touching the cache seem to look more and more necessary

fierce-truck-19259

01/08/2025, 11:53 PM

Ah, do we have any assumptions that the .terraform cache is per module currently? It is commonly configured to be shared across modules so they can share plugins, is that a potential issue at the moment? 👀

careful-address-89803

01/09/2025, 12:12 AM

maybe I've misunderstood, but I thought that the .terraform directory had to be per-module (well, per root module). up to now, we've been sharing the terraform plugin provider cache. when using that cache, TF will symlink providers from there instead of downloading them into the .terraform directory. Unless I've missed a setting? that would solve this.

careful-address-89803

01/09/2025, 12:14 AM

we're just setting the

TF_PLUGIN_CACHE_DIR

envvar to use the TF provider cache. If you're passing that through with

[download-terraform].extra_env_vars

they'll conflict

fierce-truck-19259

01/09/2025, 12:15 AM

Ah yeah I have that set as well via ~/.terraformrc to have a shared cache just in general

fierce-truck-19259

01/09/2025, 12:16 AM

which I think terraform look up still

fierce-truck-19259

01/09/2025, 12:18 AM

it definitely writes there at least, wiped it at some point while testing

fierce-truck-19259

01/09/2025, 12:25 AM

AFAIK there's no need for a per-module .terraform dir, it's pretty much just if you want a provider cache per module locally, otherwise it can be put higher up

fierce-truck-19259

01/09/2025, 12:36 AM

Is the only way to force the init:s to not run concurrently to add some lock? Or can it be done with something convenient on the rule graph 😄? Was thinking to try that too to get some better intel on the concurrency-issue-direction

careful-address-89803

01/09/2025, 1:04 AM

ah! so, when pants is "not using the cache", it's just leaving the

TF_PLUGIN_CACHE_DIR

envvar unset, which means that the value if your tfrc will take over. So, effectively, every request is still using a TF cache somewhere. If you try it without that setting in your tfrc, does it still fail?

fierce-truck-19259

01/09/2025, 1:10 AM

Hmm good point, give me a sec will check

careful-address-89803

01/09/2025, 1:12 AM

Also not needing a .terraform dir per root module (everywhere you'd run

terraform apply

) is new to me. I always thought you needed one per root module because it stores the last backend config and stuff like that. Submodules only need them if you're treating them like root modules (which is what happens with

terraform validate

). Can you link me a doc on how to do that?

fierce-truck-19259

01/09/2025, 1:24 AM

Yeah it still fails. Ah I specifically mean the provider cache regarding the .terraform dir per root module

fierce-truck-19259

01/09/2025, 1:29 AM

For state etc with apply it is a different story, but cache-wise maybe it could matter if it's there or somewhere more global

careful-address-89803

01/09/2025, 2:03 AM

ah, that's unfortunate that it didn't work. I've pushed a commit that uses 1 cache per module. I think if we stick with that we could also use caches for the other operations that we've removed it from in this MR.

fierce-truck-19259

01/09/2025, 7:35 PM

I've fixed this in my env by just providing lockfiles for everything. That seems to work well, I haven't found any way to make it work without

careful-address-89803

01/12/2025, 5:07 PM

hmm, I wonder if that's because it's not overriding the setting in .terrformrc for some operations, so we still have the shared cache problem.

fierce-truck-19259

01/13/2025, 7:09 PM

I tested with and without .terraformrc sharing cache as well, it seemed to be hitting some race either way

fierce-truck-19259

01/13/2025, 7:13 PM

But what seems the most interesting is the lockfiles that terraform generates unless they already exist, those seem to be broken to generate concurrently. That's something you would rarely do but pants will cause

fierce-truck-19259

01/13/2025, 7:13 PM

I would bet on that specific process being the issue there

careful-address-89803

01/19/2025, 3:06 AM

yeah, I'd bet it's a race on the cache as well. I've been trying generating lockfiles locally and it seemed to work, but I'll give it another go. I'm hoping that forcing an individual cache in all cases would work.

careful-address-89803

01/19/2025, 4:48 AM

Can you try it again? Now it creates a cache per-module and always uses it. Hopefully that eliminates the errors.

fierce-truck-19259

01/22/2025, 9:42 PM

I tried it again, it's still being race:y. But it didn't create a

.terraform.d

for each module, in that case it should right, or did you make it put them somewhere else but per module?

fierce-truck-19259

01/22/2025, 9:50 PM

Ah wait, does it handle this?

This directory must already exist before Terraform will cache plugins; Terraform will not create the directory itself.

Maybe it falls back if that fails

fierce-truck-19259

01/22/2025, 9:53 PM

Nvm saw it did, hmm

careful-address-89803

01/22/2025, 10:12 PM

It should now be trying to have a separate directory within the single named cache. It uses the subpath to the BUILD file that the terraform_module is in as the subpath within the named cache (something like ~/.cache/pants/named_caches/terraform/path/to/module), and it should be using that cache for all terraform operations. I'm really confused as to what's still happening. It still has the race condition during lockfile generation, right? Any chance you have multiple

terraform_module

in the same BUILD file?

fierce-truck-19259

01/22/2025, 10:18 PM

Got it. The race tend to be on

terraform validate

(pants check ...) unless there's lockfiles generated. Not so much on generating lockfiles themselves. But both should be doing

init

right

fierce-truck-19259

01/22/2025, 10:20 PM

I'm thinking it could depend on the condition of

required_providers

fierce-truck-19259

01/22/2025, 10:26 PM

So the scenario was without lockfiles

pants check ::

fails on a bunch of errors from terraform like

there is no package

does not match any of the checksums recorded in the dependency lock file

etc. For the exact same source I can

pants generate-lockfiles

and then

pants check ::

successfully. Let me double check your branch again actually

fierce-truck-19259

01/22/2025, 10:29 PM

Okay, I must have failed testing last time. I think it's actually fixing it with your last commits with the cache-per-module

fierce-truck-19259

01/22/2025, 10:38 PM

Yes @careful-address-89803 🙏🙏 I think that did it actually, tested with a few cache levels like remote/local/none

careful-address-89803

01/22/2025, 10:41 PM

Awesome! That's a relief, I think the next step would have been to learn Golang and try making the cache concurrency-safe upstream 😅

fierce-truck-19259

01/22/2025, 10:42 PM

Haha 😄 Awesome digging, thank you for figuring it out 🙂

careful-address-89803

01/22/2025, 10:51 PM

You're welcome! Thank you for your patience and persistence trying out the changes!

🙏 1

incalculable-france-51377

02/19/2025, 3:22 PM

sorry for jumping on this old thread, but i'm seeing intermittent check errors with terraform after upgrading from 2.21 to 2.24:

Copy code

14:50:21.95 [ERROR] Completed: pants.backend.terraform.goals.check.terraform_check - terraform-validate failed (exit code 1).
Partition #1 - `terraform validate` on `api_search/infra:infra`:
Success! The configuration is valid.


Partition #2 - `terraform validate` on `common/infra/terraform:infra`:
╷
│ Error: registry.terraform.io/hashicorp/google: there is no package for registry.terraform.io/hashicorp/google 5.33.0 cached in .terraform/providers
│ 
│ 
╵

could this be related? from this thread, i'm not sure what the solution might be.

careful-address-89803

02/19/2025, 9:19 PM

might be related? the bugfix was backported to 2.24.1 and 2.24.2, can you try one of those?

fierce-truck-19259

02/19/2025, 10:32 PM

Yes, the backport is on 2.24.1+ I think, so 2.24.0 might have that issue still. Since that minor was cut before the fix

fierce-truck-19259

02/19/2025, 10:33 PM

(and 2.23.2+)

incalculable-france-51377

02/20/2025, 4:30 PM

2.24.2 is resulting in the same error, although it appears to be happening all the time instead of intermittently (could be wrong on that).

fierce-truck-19259

02/20/2025, 5:00 PM

Do you have terraform lockfiles?

fierce-truck-19259

02/20/2025, 5:00 PM

Ah wait

fierce-truck-19259

02/20/2025, 5:02 PM

Not sure if related, but there's an issue with hashicorp providers registry-wide in the last 2 days or so, and could be related. As in they can't be fetched for whatever reason

careful-address-89803

02/21/2025, 9:27 PM

Sounds unrelated, then. It sounds like you have a reliable reproduction for it. Can you make a separate thread or GH issue? My first thought is that the failure looks like a shared module, and it's not always correct to run

validate

against those directly (eg they need a "root module" to specify provider versions). Beyond that, we're really just calling

terraform init

and

terraform validate

and setting the providers cache. So I'd expect it's something we'd need to dig into

fierce-truck-19259

03/25/2025, 7:53 PM

@careful-address-89803 I'm unfortunately seeing these races still. They are much more rare after proper lockfiles, but they do happen still. I wonder if maybe init must be sequenced, or is the terraform cache not isolated or something maybe? It is always fixable by retrying so it is quite telling that it is a race going on still. It feels like it's maybe just the fact that init is not concurrent safe, unless the cache is fully isolated per module. But I'm not sure

careful-address-89803

03/28/2025, 2:24 AM

I've been thinking about this. We do separate the terraform cache per module. One thing I thought of is that a

terraform_module

might be checked by both itself and any `terraform_deployment`s that use it as a root module, and they would share the same cache because it's the same root module. In that case it should be safe to disable the check (I think on the module?)

fierce-truck-19259

03/29/2025, 9:21 PM

@careful-address-89803 hmm. We have 0

terraform_deployment

for all the modules. Could it be that any concurrent request for the module could cause races down-the-line?

fierce-truck-19259

03/29/2025, 9:22 PM

(Say a random plugin, requesting for it)

fierce-truck-19259

03/29/2025, 9:27 PM

I'm thinking maybe anything that requests the same module concurrently could be affected. So while we have our own plugin doing it, it probably would be affected the same way as

terraform_deployment

. So maybe that has to be locked down to one operation per module concurrently regardless of who the caller was. I know @happy-kitchen-89482 solved something similar at a lower-level, but maybe has some ideas on how it could be done at this level too

fierce-truck-19259

03/29/2025, 9:32 PM

Are the rules not correctly setup 🤷‍♂️ They seem sensible, otherwise I'd just honestly add a lock on these on the target

fierce-truck-19259

03/30/2025, 1:43 AM

Correct me if I'm wrong. But there's no concurrency guarantees with

TerraformProcess

, so regardless how it's invoked it's not safe because the tool itself is not safe to invoke concurrently. And there's a bunch of ways that could end up invoked concurrently even with the isolated caches. I think that is the primary remaining issue honestly.

fierce-truck-19259

03/30/2025, 1:54 AM

@happy-kitchen-89482 can you flag a tool as non-concurrent, (or "wishes", concurrency group) or similar, or do you have to just enforce that separately?

careful-address-89803

03/30/2025, 2:44 AM

Oh, yeah, I guess if several things requested terraform init for the same terraform module then it wouldn't be concurrency safe. I think that we could get the invocation cached. though I'm not sure why they aren't being cached currently. Maybe because it's not that they'd need to be cached, but essentially deduplicated? Also concurrent invocations of the terraform init should be something we could see with logs (although maybe not with the ones we have now)

careful-address-89803

03/30/2025, 2:46 AM

I might be able to look into this more on Wednesday. You mentioned that you have a custom plugin, can you share anything more about it?

✅ 1

fierce-truck-19259

03/30/2025, 2:55 AM

Sure! We have a plugin for fluxcd that depends on terraform modules to release them. So I'm thinking things requesting the module to init concurrently can lead into that. It's quite random, maybe 5-10%. It essentially never fails on just a plain

pants check

on the modules. So that''s why I'm thinking it might be able to get into a racey state if you're unlucky with the concurrency with requests towards the modules

fierce-truck-19259

03/30/2025, 2:57 AM

The plugin might be causing more concurrent request to trigger it

fierce-truck-19259

03/30/2025, 3:02 AM

I can probably get you some logs if that would help

fierce-truck-19259

03/30/2025, 3:28 AM

Ah. We (via the plugin) request

TerraformDependenciesField

which in turn probably does an init:

Copy code

@dataclass(frozen=True)
class TerraformInitRequest:
    root_module: TerraformRootModuleField
    dependencies: TerraformDependenciesField

It doesn't fix the init being flakey but probably why it happens

happy-kitchen-89482

03/30/2025, 8:38 PM

AFAICR there is no way to cause a tool to run exclusively, but that should be pretty easy to add

fierce-truck-19259

04/07/2025, 5:59 PM

AFAICR there is no way to cause a tool to run exclusively, but that should be pretty easy to add

Maybe that would be a nice-to-have either way. Maybe terraform is a bit of a special case (but also not really?) but there's possibly other tools where I can imagine it to be useful too

fierce-truck-19259

04/07/2025, 8:11 PM

@careful-address-89803 I can repro the issue when running init on separate modules concurrently, when they want the same providers. Even without involving that plugin. So I wonder if the cache separation is really working as we think it is, or if there's some race going on at a lower level even? @happy-kitchen-89482 Can I force the tasks to run sequentially somehow with the options available today, like on pants global level maybe? Just to see if I can confirm the concurrency of them being the issue

happy-kitchen-89482

04/07/2025, 11:26 PM

You can turn concurrency down to 1 with

<https://www.pantsbuild.org/2.24/reference/global-options#process_execution_local_parallelism>

fierce-truck-19259

04/07/2025, 11:30 PM

Yeah was experimenting with that, then discovered

--rule-threads-core

--rule-threads-max

could they still cause concurrent execution even with the one you mentioned set to 1?

happy-kitchen-89482

04/07/2025, 11:57 PM

That is concurrency for

@rule

execution in the pants process. To make tool subprocesses run exclusively you want the option I linked to

✅ 1

fierce-truck-19259

04/08/2025, 2:15 PM

Having a really hard time to reproduce it locally, been trying all kinds of combinations and exploring if remote caching could be involved

careful-address-89803

04/08/2025, 3:23 PM

Pants uses the path to the module within the repo as a subfolder within the pants' cache dir to create the TF cache. Is it possible that's contributing to this?

fierce-truck-19259

04/08/2025, 8:31 PM

Possibly, could it maybe conflict with sub-modules that are within a root module directory that have their own lockfiles? So something like:

Copy code

- rootmodule/
  - BUILD
  - <http://main.tf|main.tf>
  - <http://versions.tf|versions.tf>
  - .terraform.lock.hcl
  - submodule1/
    - BUILD
    - <http://main.tf|main.tf>
    - <http://versions.tf|versions.tf>
    - .terraform.lock.hcl    
  - submodule2/
    - BUILD
    - <http://main.tf|main.tf>
    - <http://versions.tf|versions.tf>
    - .terraform.lock.hcl

fierce-truck-19259

04/08/2025, 8:40 PM

So for example in the real case it seems to be that I have remote cache hits for every

terraform init

of rootmodule, submodule1 and submodule2. And a remote cache hit for

terraform validate

for example on rootmodule and submodule1. But then a failure on the cache missed

terraform validate

of submodule2. So, another idea, does the cached

terraform init

maybe not populate the terraform cache so that a subsequent uncached

terraform validate

fails?

fierce-truck-19259

04/08/2025, 8:41 PM

@careful-address-89803 As in the

terraform init

step is cached, but because of that it does not fetch providers, and then when

terraform validate

runs on the same module it doesn't have the providers

fierce-truck-19259

04/08/2025, 8:49 PM

Maybe

terraform init

should just be uncacheable, I can roughly puzzle that together to be the cause for what I'm seeing, but a bit difficult to be completely sure

careful-address-89803

04/10/2025, 1:27 AM

oh uh I didn't think of submodules nested in a module. I've only seen them in other folders. That will cause their cache dirs to be nested in each other. I don't know if that would cause issues, since the filepaths wouldn't overlap (so ".cache/.../rootmodule/registry.terraform.io/{{ plugins }}" vs ".cache/.../rootmodule/submodule1/registry.terraform.io/{{ plugins }}"). But it's also not helping. Let me write a version that makes the cache dirs nonoverlapping (I'll probably just hash the dir path). I can also make the call to init uncacheable. (I think often init isn't cached because files are modified and those are used as part of the cache key)

careful-address-89803

04/10/2025, 3:01 AM

I think, though, that remote caching might be the problem here. I'm assuming you have several instances of the remote caching server? Using the local providers cache, the ".terraform" directory will contain symlinks to the providers cache which is inside Pants's named cache (ex "tf/tf4/mod/.terraform/providers/registry.terraform.io/hashicorp/azuread/2.15.0/linux_amd64" -> "/.../.cache/pants/named_caches/terraform_plugins/{{ hash }}/registry.terraform.io/hashicorp/azuread/2.15.0/linux_amd64/"), and that's what gets pulled into the digest. But if there are multiple remote cache servers, it's possible that one server runs

init

which populates the module cache, but the one running

validate

hasn't run

init

so it doesn't have the module cache populated.

careful-address-89803

04/10/2025, 3:04 AM

If that's the case, I think we'd need to reorganise the rules, since

init

and

validate

must be done in the same workunit if they're using the provider cache. I think you could check if that's the case by comparing the caches across nodes. The paths should be stable.

careful-address-89803

04/10/2025, 3:05 AM

(also just for documentation, if there isn't a lockfile, TF needs to redownload the providers even if they are present in the module cache)

careful-address-89803

04/10/2025, 3:16 AM

You could see if https://github.com/pantsbuild/pants/pull/22183 fixes it, but I do think that it's related to remote caching

fierce-truck-19259

04/11/2025, 5:42 PM

But if there are multiple remote cache servers, it's possible that one server runs
init
which populates the module cache, but the one running
validate
hasn't run
init
so it doesn't have the module cache populated.

Hmm let me just clarify for myself. I have a CI runner executing the work, and using a remote cache, but it's not using remote execution. Do you mean remote execution or only remote caching being affected also? The actual execution is done on github action runners. But here's where I'm thinking even regardless of remote or local cache. Does a cached init step populate the terraform provide cache? Or does it have to always be run? Because that would explain if it works locally because if it's cached then it's also been run, but on CI with remote cache then if it's in the cache it may not have been run. So if the pants cache doesn't populate the terraform cache then it wouldn't work for it to be cached?

fierce-truck-19259

04/11/2025, 5:43 PM

@careful-address-89803 it's a bit of brain gymnastics to do this debugging in writing but hopefully it came across somewhat what I'm thinking 😄

fierce-truck-19259

04/11/2025, 5:47 PM

If that's the case, I think we'd need to reorganise the rules, since
init
and
validate
must be done in the same workunit if they're using the provider cache.

I think this is really getting to it, or actually any terraform operation that requires the providers must depend on init having been run in the same work unit

fierce-truck-19259

04/11/2025, 10:08 PM

So from there I arrived at the (possibly easier option of) maybe just not ever cache the init within pants? Since the terraform provider cache will be used to cache the provider fetches anyways, maybe the init could just be made an uncacheable rule? Or would that kill the provider cache from being kept entirely?

careful-address-89803

04/22/2025, 12:25 AM

I think you are correct, the problem we're seeing is that the digest after

init

contains symlinks (because that's how terraform uses its cache). But the symlinks point into the pants named-cache, which is not replicated across machines. So if system0 runs the init, then system0 has the providers in its named cache (which is local); it then pushes the digest (which contains symlinks to the named cache) to the remote cache. System1 wants to run, it pulls the digest from the cached init down, but that digest has symlinks to providers that aren't in system1's named cache. I think you are correct that we can fix this by always running

init

, either by making it uncacheable or by combining it in the same execution as whatever command will be run after. If the providers are downloaded in terraform's module cache, it should just use them.

fierce-truck-19259

04/22/2025, 8:19 AM

🙏 that sounds like a good direction then, not sure on pros/cons on which option to pick but it seems very likely either of them will solve the issue

fierce-truck-19259

04/22/2025, 3:57 PM

@careful-address-89803 are you already planning to look into a fix or should I give it a go?

careful-address-89803

04/23/2025, 3:33 PM

I can work on it today. I think it would be simple enough to modify TerraformProcess to have a list of commands to run and then run them all in the launcher script

❤️ 1

careful-address-89803

04/24/2025, 2:39 AM

https://github.com/pantsbuild/pants/pull/22218

13 Views

Open in Slack

Previous Next