I'm curious if anyone has seen this issue before. ...
# general
e
I'm curious if anyone has seen this issue before. I've enabled the experimental github actions remote cache in my CI, and it works great, sometimes. Everything works as expected when I run my lint/check/package/test pipeline (for tests, it only does unit tests). I run this on the
cruizba/ubuntu-dind:noble-27.1.1
docker image. However, my other CI job for the integration tests has some different constraints since it needs to run a very large docker compose stack. I run this "raw" on the github runner (an ARC runner provided by my organization). Running directly on the runner as opposed to inside a container avoids dealing with complicated docker-in-docker issues. Anyways, I get a number of 403 errors when trying to access the cache in this job. Having a hard time debugging this.. Details in thread.
The cache stats logged by pants during the integration test job show that the cache is working sometimes:
Copy code
21:51:06.14 [INFO] Counters:
  backtrack_attempts: 0
  docker_execution_errors: 0
  docker_execution_requests: 0
  docker_execution_successes: 0
  local_cache_read_errors: 0
  local_cache_requests: 112
  local_cache_requests_cached: 0
  local_cache_requests_uncached: 112
  local_cache_total_time_saved_ms: 0
  local_cache_write_errors: 0
  local_execution_requests: 112
  local_process_total_time_run_ms: 16083220
  remote_cache_read_errors: 39
  remote_cache_request_timeouts: 0
  remote_cache_requests: 66
  remote_cache_requests_cached: 0
  remote_cache_requests_uncached: 23
  remote_cache_speculation_local_completed_first: 4
  remote_cache_speculation_remote_completed_first: 0
  remote_cache_total_time_saved_ms: 0
  remote_cache_write_attempts: 66
  remote_cache_write_errors: 23
  remote_cache_write_successes: 43
  remote_execution_errors: 0
  remote_execution_requests: 0
  remote_execution_rpc_errors: 0
  remote_execution_rpc_execute: 0
  remote_execution_rpc_retries: 0
  remote_execution_rpc_wait_execution: 0
  remote_execution_success: 0
  remote_execution_timeouts: 0
  remote_process_total_time_run_ms: 0
  remote_store_exists_attempts: 770
  remote_store_exists_errors: 88
  remote_store_exists_successes: 317
  remote_store_missing_digest: 0
  remote_store_read_attempts: 0
  remote_store_read_cached: 0
  remote_store_read_errors: 0
  remote_store_read_uncached: 0
  remote_store_request_timeouts: 0
  remote_store_write_attempts: 212
  remote_store_write_errors: 0
  remote_store_write_successes: 212
I have a number of warning logs like:
Copy code
22:28:52.35 [WARN] Failed to write to remote cache (1 occurrences so far): failed to query pants_ci_cache/byte-store/21/c7/21c741a7e0a9441673601c3b947bea357d5754824234584970308bf78375e3d3: PermissionDenied (persistent) at stat, context: { uri: <https://ghe-actions-prod-qhqyjglk.s3.amazonaws.com/actions-69c8c8939b70/9bb02394953d4d45a28a6ccad6554933/5b57473ef36b141086d40095c0eef846?AWSAccessKeyId=AKIAQ3EGVTWOBBE5CS57&Expires=1733527732&Signature=24h6W5N7uX%2FJLKZ%2BQTHr7CtJhbY%3D>, response: Parts { status: 403, version: HTTP/1.1, headers: {"x-amz-request-id": "50E4VFZEM4E4GT66", "x-amz-id-2": "7MwRomL34rFYOzzquN0+SG2N0e1kqCbR32VDqdC5Z6aJK5EOpnmF9J3g9iRX/3TrRbElYhRhdTY=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Fri, 06 Dec 2024 22:28:51 GMT", "server": "AmazonS3"} }, service: ghac, path: pants_ci_cache/byte-store/21/c7/21c741a7e0a9441673601c3b947bea357d5754824234584970308bf78375e3d3 }
I am setting up environment variables as recommended on https://www.pantsbuild.org/prerelease/docs/using-pants/remote-caching-and-execution/remote-caching#workflow and I can see from job logs that these variables have been set in the runner
My gut seems to suggest that Github is setting me up with a token with different permissions depending on the job, but this feels a bit silly. Has anyone seen anything like this before?
f
You are likely getting rate limited by GitHub
e
It is an enterprise server (ie. not public Github) and we were told that there were no rate limits on place. I'll follow up with the Github team to confirm this though
f
Interesting, yeah that would have been my guess on why it works intermittently. Because rate limits from them can result in both 403 and 429
e
At a guess it looks like its AWS S3 bucket itself raising the errors. Possibly our github instance has no rate limit, but the underlying S3 bucket does. Not sure exactly how all the cache stuff is architected behind the scenes
I did some googling and it seems that AWS uses a 503 error for rate limiting (so that it is considered a server error and you don't get billed for 4XX client requests while undergoing a DOS attack) I did notice that the (fairly small) number of "read" errors have a 400 code, and the write errors have a 403 code. I am curious about the stats output: "remote cache" vs "remote store"
f
Maybe run with
-ldebug
to get some remote cache logs
e
Just had a run with debug logging finish up. It looks like there isn't any new logs related to the remote caching. 😞 It shows what digests are included in each process, so I suppose I could debug cache misses vs hits, but there is nothing else about the remote http call
f
You should see logs like
Copy code
Remote cache lookup for:
and
Copy code
[DEBUG] Using [remote_cache::CommandRunner
or along those lines, they should log request errors towards the store IME as well
e
I do see this line:
Copy code
13:04:47.08 [DEBUG] Using [cache::CommandRunner { inner: remote_cache::CommandRunner { inner: bounded::CommandRunner { inner: SwitchedCommandRunner { .. }, .. }, .. }, .. }, cache::CommandRunner { inner: remote_cache::CommandRunner { inner: bounded::CommandRunner { inner: SwitchedCommandRunner { .. }, .. }, .. }, .. }] for process execution.
just once near the start while initializing scheduler/pantsd/etc. I do also see the
Remote cache lookup for: <task>
lines, but the only lines relating to the http error are the warning level logs:
Copy code
2024-12-07T12:50:42.4182878Z 12:50:42.41 [33m[WARN][0m Failed to read from remote cache (1 occurrences so far): failed to read pants_ci_cache/action-cache/ca/4b/ca4be85b2bd62dd89d13365599166268394a755205426cd5ae6a5a4a16a0a9a7: Unexpected (persistent) at read, context: { uri: <https://ghe-actions-prod-qhqyjglk.s3.amazonaws.com/actions-69c8c8939b70/9bb02394953d4d45a28a6ccad6554933/2287473ef36b141086d40095c0eef846?AWSAccessKeyId=AKIAQ3EGVTWOBBE5CS57&Expires=1733579442&Signature=kJL4aUwFQOjU1v2b9zDTS3IlV8I%3D>, response: Parts { status: 400, version: HTTP/1.1, headers: {"x-amz-request-id": "XFDEE0WMTXQ7AK17", "x-amz-id-2": "rhAXrDD1An2kfE87JPAHRYIPbPtBvTLQi/yQCDTVEKdvw51WNUHS3PllWt/MeBDNgJXuMYKBe/M=", "x-amz-region": "us-east-1", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Sat, 07 Dec 2024 12:50:41 GMT", "connection": "close", "server": "AmazonS3"} }, service: ghac, path: pants_ci_cache/action-cache/ca/4b/ca4be85b2bd62dd89d13365599166268394a755205426cd5ae6a5a4a16a0a9a7, range: 0- } => <?xml version="1.0" encoding="UTF-8"?>
Skimming through the pants source, it looks like there are logs for hit or miss, but if the error is caused at the HTTP call itself, then logging that error is the only log
f
What are you using as the target for the remote cache in pants? As in what do you have in between S3 and pants?
We use self-hosted github runners for example, but we use bazel-remote-cache as the handler of remote caching from pants via REAPI, and after that we have a blob storage where it is being actually stored
e
PANTS_REMOTE_STORE_ADDRESS: <https://github.software.gevernova.com/_services/artifactcache/ocrHE8ZScyxq8pTiT1IyKRHNwdDtYL7HOvJRB2oXuk7JCGeFXS/>
Its going through the experimental-github-actions-cache, but beyond that, I don't know. (Org is migrating us from gitlab to an enterprise github instance, so I'm rewriting all CI for github.)
We didn't use caching before, but since I see that the github actions cache can be used, this avoids the need for any other kind of reapi infrastructure, so it was my choice for "probably easiest"
f
I have some suspicion that going via the github-actions-cache it may be subject to iffy limitations or weird issues. I would probably recommend looking at booting up something like bazel-remote-cache https://github.com/buchgr/bazel-remote and then have that use S3 as the storage backend. And then tell pants to talk to that instead of github-actions-cache. I think that is more reliable currently
We have that exact setup but with GCS instead of S3 and it works really well for a decently large setup
e
I suppose that implies I need to go bug my devops team to run
bazel-remote-cache
up in the cloud somewhere then haha.
Thanks for all the support. Its'e been really helpful
f
Haha, yeah exactly 🙂
e
I just noticed this log:
Copy code
2024-12-07T15:52:18.4765380Z <Error><Code>InvalidArgument</Code><Message>Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.</Message><ArgumentName>Authorization</ArgumentName><ArgumentValue>null</ArgumentValue><RequestId>ATSBDNV56PRJE6D2</RequestId><HostId>wRkPUK3v7M8Njes5T/+JNBxYU1LD7odDa2fEXG1uGXAuM+Zw/97NPxSma8ELujp6uYY9ulbJP+8=</HostId></Error>
that is showing up as a new line, but always immediately after the
Failed to read
warning lines. This looks much more like an actionable error messages. Seems to imply that the Token being provided by github doesn't have the remote read permissions. I've been using
actions/cache
already and it is working, and these are supposed to be the same URLs. Maybe I can figure it out from there. Need to update AWS sdk version somewhere or something
Going to repost as new thread since most of what's in this one seems to be irrelevant to this, I think