I’m getting an error when uploading a cache with p...
# general
I’m getting an error when uploading a cache with pants that is being caused by pants not respecting the max Grpc message size:
Copy code
[WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5812170 vs. 4194304)"
If I increase the max receivable size on the server from the 4MB default, the cache starts working. Theoretically, AFAIK, client and server should automatically agree on a max message size and honor that throughout the message exchange. I tried looking up how pants is using
but couldn’t see anything exotic there in terms of message size. Has anyone seen anything like that happen?
The cache write in question is probably using the REAPI "batch" API. You can set
to lower the threshold before the streaming API is used.
The default is 4,194,304 bytes.
The default value for remote-store-batch-api-size-limit is
according to docs. If that was the case it should never be above that, right? I’m getting
in this particular error messages but already had times where I saw close to 10MB
I tried explicitly setting
but it seems to not limit the size
I forget the semantics of how that option works (and I'm the one who wrote it ...).
😄 1
The gRPC stuff could definitely stand to have a review at some point of what its semantics currently are. But someone would have to pick up work.
As I originally wrote it, it was supposed to respect the configured limit from the server.
Is your remote cache set to return the preferred value in the REAPI
It was sending 0 (as if it was unlimited) but per the author’s suggestion I set it to 4194304 and rebuilt but no good. Let me try printing what comes back in
to check what it’s returning
If the streaming API is triggering the too big error, then that would be a different issue.
If the streaming API is triggering the too big error, then that would be a different issue.
How can I hone in to know which API is triggering the error?
Turn on Rust logging and enable logging in
should work.
at least I think so. working from memory only.
although some of that might interact with the Pants logging code, so you might need
enabled. (Been a while since I had turned on Rust logging.)
adds a little more info but nothing on grpc even with
. Any other ideas?
After struggling a little bit to get logs I ended up just writing the output to a file so we can quickly see what’s going on. Turns out
seems to be being sent correctly. This is what I get printing ServerCapabilities:
Copy code
ServerCapabilities { cache_capabilities: Some(CacheCapabilities { digest_functions: [Sha256], action_cache_update_capabilities: Some(ActionCacheUpdateCapabilities { update_enabled: true }), cache_priority_capabilities: Some(PriorityCapa
bilities { priorities: [PriorityRange { min_priority: 0, max_priority: 0 }] }), max_batch_total_size_bytes: 4194304, symlink_absolute_path_strategy: Allowed, supported_compressors: [Zstd], supported_batch_update_compressors: [Zstd] }), 
execution_capabilities: None, deprecated_api_version: None, low_api_version: Some(SemVer { major: 2, minor: 0, patch: 0, prerelease: "" }), high_api_version: Some(SemVer { major: 2, minor: 3, patch: 0, prerelease: "" }) }
Looking at the error again, I believe
grpc: received message larger than max
is gRPC level config. I had suggested the REAPI level options since it might have an indirect effect on message size. I am not sure Pants configures any message size limits in
itself. (https://github.com/pantsbuild/pants/tree/main/src/rust/engine/grpc_util is relevant Pants crate.)
Might be worth seeing if
has a config we should be using if we are not already using it?
https://github.com/pantsbuild/pants/pull/12537 According to this discussion, there’s nothing we need to do in tonic specifically. From what I understand in the code base, if the message is larger than the max batch size, pants should use the streaming api, right?
Tonic does have a max message size that you can set up but I think that it shouldn’t be necessary if we are using the streaming api instead
pants should use the streaming api, right?
It should (and the logs would show
RPCs). Of course, it begs the question whether the gRPC max message size is being exceeded for the
API stream messages.
Which remote cache server are you using?
I’ve been using
. I’d love to see the rpc logs but can’t turn them on for the life of me. I’m calling pants with:
RUST_LOG=trace pants export-codegen //pkg:build -ltrace --keep-sandboxes=on_failure --no-local-cache
Am I missing anything here?
You need
(did a quick search to find it, I had forgotten the name)
And you could probably enable logging on the bazel-remote side?
Got the pants logs here. Don’t know enough about how the remote caching works in pants but it seems to me that it trying to use the streaming api. But I don’t see any errors popping up. Does this give you any further clues?
Copy code
09:51:57.86 [ERROR] 1 Exception encountered:

Engine traceback:
  in `export-codegen` goal

IntrinsicError: Failed to create symlink to ../../../../node_modules/.pnpm/@amplitude+analytics-browser@1.9.1/node_modules/@amplitude/analytics-browser at /Users/theoribeiro/Dev/trinio/trinio/dist/codegen/node/apps/checkout/node_modul
es/@amplitude/analytics-browser: File exists (os error 17)
an error unrelated to remote cache
and remote cache error:
Copy code
09:52:00.56 [WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5812170 vs. 4194304)"
weirdly the three session end writes to remote cache are all small:
Copy code
09:51:57.86 [TRACE] Session end task `remote cache write Digest { hash: Fingerprint<089ee036609c36c38e6abcca42dd7e5039b7bc29b5e2051a71735a724e41b610>, size_bytes: 138 }` completed successfully
09:51:57.86 [TRACE] Session end task `remote cache write Digest { hash: Fingerprint<8a44c11b4a792d93843b654f9924f215b3c761e45852c33de4db22ba0109acc0>, size_bytes: 138 }` completed successfully
09:52:00.56 [TRACE] Session end task `remote cache write Digest { hash:Fingerprint<d686e419484f63fd28c18983ea994cf6fbe751ddb4ab55b80190698993438701>, size_bytes: 142 }` completed successfully
Can you re-run with
and provide it here? The stats will contain a histogram of the remote cache blob sizes. I am wondering if Pants is trying to write a large blob or not. (Although with the streaming API, that should be fine.) But it can help refine the inquiry into the issue.
an error unrelated to remote cache
Yeah, this one happens when my dist folder is not empty. Not really too big of an issue.
The only diff when running with
Copy code
13:44:55.99 [INFO] Counters:
  backtrack_attempts: 0
  docker_execution_errors: 0
  docker_execution_requests: 0
  docker_execution_successes: 0
  local_cache_read_errors: 0
  local_cache_requests: 0
  local_cache_requests_cached: 0
  local_cache_requests_uncached: 0
  local_cache_total_time_saved_ms: 0
  local_cache_write_errors: 0
  local_execution_requests: 16
  local_process_total_time_run_ms: 52139
  remote_cache_read_errors: 0
  remote_cache_request_timeouts: 0
  remote_cache_requests: 8
  remote_cache_requests_cached: 6
  remote_cache_requests_uncached: 1
  remote_cache_speculation_local_completed_first: 1
  remote_cache_speculation_remote_completed_first: 6
  remote_cache_total_time_saved_ms: 271
  remote_cache_write_attempts: 2
  remote_cache_write_errors: 1
  remote_cache_write_successes: 1
  remote_execution_errors: 0
  remote_execution_requests: 0
  remote_execution_rpc_errors: 0
  remote_execution_rpc_execute: 0
  remote_execution_rpc_retries: 0
  remote_execution_rpc_wait_execution: 0
  remote_execution_success: 0
  remote_execution_timeouts: 0
  remote_process_total_time_run_ms: 0
  remote_store_missing_digest: 0
  remote_store_request_timeouts: 0
I’m gonna run it again now by increasing the MaxRecvMsgSize on the server to compare the outputs
actually you will need to add
, that will enable the histograms
Sorry, I was analyzing the log when there is no limit on the server (setting MaxRecvMsgSize to a ridiculously high value). This is only part of the log bc it got too big with the upload of all files. You can see that 1) there are a lot more files being uploaded to the cache and 2) the size_bytes make a lot more sense.
And here’s the histogram of when the server has the default limit (when there are errors uploading the cache:
Copy code
14:51:03.56 [INFO] Observation histogram summaries:
14:51:03.73 [INFO] Summary of `local_store_read_blob_size` observation histogram:
  min: 1
  max: 520959
  mean: 3861.533
  std dev: 20559.834
  total observations: 104811
  sum: 404731094
  p25: 236
  p50: 645
  p75: 1598
  p90: 5435
  p95: 11359
  p99: 64383
14:51:03.74 [INFO] Summary of `remote_store_blob_bytes_uploaded` observation histogram:
  min: 0
  max: 428
  mean: 103.370
  std dev: 98.918
  total observations: 27
  sum: 2791
  p25: 6
  p50: 125
  p75: 139
  p90: 142
  p95: 304
  p99: 428
14:51:03.80 [INFO] Summary of `local_process_time_run_ms` observation histogram:
  min: 6
  max: 49183
  mean: 3122.812
  std dev: 11888.894
  total observations: 16
  sum: 49965
  p25: 22
  p50: 39
  p75: 93
  p90: 107
  p95: 166
  p99: 49183
14:51:03.86 [INFO] Summary of `remote_cache_get_action_result_time_micros` observation histogram:
  min: 11464
  max: 42079
  mean: 25393.714
  std dev: 12466.941
  total observations: 7
  sum: 177756
  p25: 13039
  p50: 18527
  p75: 36319
  p90: 40063
  p95: 42079
  p99: 42079
14:51:03.92 [INFO] Summary of `remote_cache_get_action_result_network_time_micros` observation histogram:
  min: 6852
  max: 39839
  mean: 22771.143
  std dev: 12556.225
  total observations: 7
  sum: 159398
  p25: 10415
  p50: 18159
  p75: 33151
  p90: 36991
  p95: 39839
  p99: 39839
14:51:04.06 [INFO] Summary of `local_store_read_blob_time_micros` observation histogram:
  min: 13
  max: 3938303
  mean: 2100973.634
  std dev: 889304.076
  total observations: 104811
  sum: 220205147530
  p25: 1538047
  p50: 1998847
  p75: 2609151
  p90: 3567615
  p95: 3817471
  p99: 3907583
14:51:04.07 [TRACE] deregistering event source from poller
Copy code
14:51:03.74 [INFO] Summary of `remote_store_blob_bytes_uploaded` observation histogram:
  min: 0
  max: 428
  mean: 103.370
  std dev: 98.918
  total observations: 27
  sum: 2791
  p25: 6
  p50: 125
  p75: 139
  p90: 142
  p95: 304
  p99: 428
max of 428 bytes
very strange
makes me wonder which part of the grpc stack is returning that error
can you enable trace logging in bazel-remote?
It would be good to inquire into the server side of the remote cache RPCs
and can you run wireshark to capture a packet trace of the remote cache connection?
I would like to know exactly what is transpiring between Pants and bazel-remote.
I’m trying to get a verbose log in bazel-remote. Looking for a flag in the code right now.
👍 1
(and I just need the decoded protocol log, not the actual bytes of the packet trace so no confidential data will leave your environment)
You should also be able to run
to capture a packet trace and then view it with Wireshark.
Alright, I just got a dump of everything going in and out of port 9092 during a pants run. There’s quite a bit of info here, anything in particular you’d want to look at?
You'll need to run the trace file through Wireshark and have it decode the HTTP/2 and gRPC.
TCP isn't much help 🙂
(you can output a trace file with the
option to
I’m trying to decode the grpc messages but wireshark isn’t recognizing the protofiles
https://grpc.io/blog/wireshark/ seems useful documentation
Yep, that’s what I’m following. For some reason it’s not finding the protofiles
👍 1
is there a particular error at all?
you would also need the Google
gRPC protobuf
clone both repos and set Wireshark search path to the top of each protobuf directory tree in each repo?
Ah, finally got the error. It’s saying google/protobuf/duration.proto wasn’t found. I cloned the protobuf repo as well and also added it to the search paths
yeah REAPI has lots of different protobuf definitions
HA! Got it
Such a mess… lol
And a version without the TCP messages
Assuming the numbers like 91, 100, 125 are chunk sizes, nothing in the trace stands out as a huge number.
and annoyingly lots of
Dissector bug, protocol HTTP2: epan/dissectors/packet-http2.c:3086: failed assertion "!((pinfo)->fd->visited) && datalen == length"
Assuming the numbers like 91, 100, 125 are chunk sizes, nothing in the trace stands out as a huge number.
Yeah, those are the package lengths
so at this point, trace logging in bazel-remote would be useful. given bazel-remote uses grpc-go, https://github.com/grpc/grpc-go#how-to-turn-on-logging should be relevant to get gRPC level logging
also, for context, I just want to confirm when the
is occurring.
and the "dissector bug" errors will prevent us from seeing that in the wireshark decode
Nothing useful on the logs
is bazel-remote a single process?
the thought being: is bazel-remote passing through the
error from somewhere else?
Yeah, it’s a single process
Shouldn’t be passing anything through with this config. I could proxy it to an s3 bucket but in this case I’m saving everything locally only
I've got no ideas currently.
any way, time for sleep given my time zone.
Well, thank you anyways @fast-nail-55400 ! I’ll report to bazel-remote and see if they have any ideas
Good night!
💤 1
I opened an issue on the bazel-remote repo asking if he’d be accepting a PR to let the user set the max message size. But he has a point that the issue is more on the pants side since we’re sending an oddly large message. https://github.com/buchgr/bazel-remote/issues/658