I’m getting an error when uploading a cache with p...
# general
l
I’m getting an error when uploading a cache with pants that is being caused by pants not respecting the max Grpc message size:
Copy code
[WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5812170 vs. 4194304)"
If I increase the max receivable size on the server from the 4MB default, the cache starts working. Theoretically, AFAIK, client and server should automatically agree on a max message size and honor that throughout the message exchange. I tried looking up how pants is using
tonic
but couldn’t see anything exotic there in terms of message size. Has anyone seen anything like that happen?
f
The cache write in question is probably using the REAPI "batch" API. You can set
--remote-store-batch-api-size-limit
to lower the threshold before the streaming API is used.
The default is 4,194,304 bytes.
l
The default value for remote-store-batch-api-size-limit is
4194304
according to docs. If that was the case it should never be above that, right? I’m getting
5812170
in this particular error messages but already had times where I saw close to 10MB
I tried explicitly setting
remote-store-batch-api-size-limit
but it seems to not limit the size
f
I forget the semantics of how that option works (and I'm the one who wrote it ...).
😄 1
The gRPC stuff could definitely stand to have a review at some point of what its semantics currently are. But someone would have to pick up work.
As I originally wrote it, it was supposed to respect the configured limit from the server.
Is your remote cache set to return the preferred value in the REAPI
GetCapabilites
response?
l
It was sending 0 (as if it was unlimited) but per the author’s suggestion I set it to 4194304 and rebuilt but no good. Let me try printing what comes back in
self.get_capabilities
to check what it’s returning
f
If the streaming API is triggering the too big error, then that would be a different issue.
l
If the streaming API is triggering the too big error, then that would be a different issue.
How can I hone in to know which API is triggering the error?
f
Turn on Rust logging and enable logging in
tonic
RUST_LOG
should work.
1
at least I think so. working from memory only.
although some of that might interact with the Pants logging code, so you might need
-ldebug
or
-ltrace
enabled. (Been a while since I had turned on Rust logging.)
l
RUST_LOG=trace
adds a little more info but nothing on grpc even with
-ltrace
. Any other ideas?
After struggling a little bit to get logs I ended up just writing the output to a file so we can quickly see what’s going on. Turns out
max_batch_total_size_bytes
seems to be being sent correctly. This is what I get printing ServerCapabilities:
Copy code
ServerCapabilities { cache_capabilities: Some(CacheCapabilities { digest_functions: [Sha256], action_cache_update_capabilities: Some(ActionCacheUpdateCapabilities { update_enabled: true }), cache_priority_capabilities: Some(PriorityCapa
bilities { priorities: [PriorityRange { min_priority: 0, max_priority: 0 }] }), max_batch_total_size_bytes: 4194304, symlink_absolute_path_strategy: Allowed, supported_compressors: [Zstd], supported_batch_update_compressors: [Zstd] }), 
execution_capabilities: None, deprecated_api_version: None, low_api_version: Some(SemVer { major: 2, minor: 0, patch: 0, prerelease: "" }), high_api_version: Some(SemVer { major: 2, minor: 3, patch: 0, prerelease: "" }) }
f
Looking at the error again, I believe
grpc: received message larger than max
is gRPC level config. I had suggested the REAPI level options since it might have an indirect effect on message size. I am not sure Pants configures any message size limits in
tonic
itself. (https://github.com/pantsbuild/pants/tree/main/src/rust/engine/grpc_util is relevant Pants crate.)
Might be worth seeing if
tonic
has a config we should be using if we are not already using it?
l
https://github.com/pantsbuild/pants/pull/12537 According to this discussion, there’s nothing we need to do in tonic specifically. From what I understand in the code base, if the message is larger than the max batch size, pants should use the streaming api, right?
Tonic does have a max message size that you can set up but I think that it shouldn’t be necessary if we are using the streaming api instead
f
pants should use the streaming api, right?
It should (and the logs would show
ByteStream
RPCs). Of course, it begs the question whether the gRPC max message size is being exceeded for the
ByteStream
API stream messages.
Which remote cache server are you using?
l
I’ve been using
bazel-remote
. I’d love to see the rpc logs but can’t turn them on for the life of me. I’m calling pants with:
RUST_LOG=trace pants export-codegen //pkg:build -ltrace --keep-sandboxes=on_failure --no-local-cache
Am I missing anything here?
f
You need
--log-show-rust-3rdparty
(did a quick search to find it, I had forgotten the name)
And you could probably enable logging on the bazel-remote side?
l
Got the pants logs here. Don’t know enough about how the remote caching works in pants but it seems to me that it trying to use the streaming api. But I don’t see any errors popping up. Does this give you any further clues?
f
Copy code
09:51:57.86 [ERROR] 1 Exception encountered:

Engine traceback:
  in `export-codegen` goal

IntrinsicError: Failed to create symlink to ../../../../node_modules/.pnpm/@amplitude+analytics-browser@1.9.1/node_modules/@amplitude/analytics-browser at /Users/theoribeiro/Dev/trinio/trinio/dist/codegen/node/apps/checkout/node_modul
es/@amplitude/analytics-browser: File exists (os error 17)
an error unrelated to remote cache
and remote cache error:
Copy code
09:52:00.56 [WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5812170 vs. 4194304)"
weirdly the three session end writes to remote cache are all small:
Copy code
09:51:57.86 [TRACE] Session end task `remote cache write Digest { hash: Fingerprint<089ee036609c36c38e6abcca42dd7e5039b7bc29b5e2051a71735a724e41b610>, size_bytes: 138 }` completed successfully
09:51:57.86 [TRACE] Session end task `remote cache write Digest { hash: Fingerprint<8a44c11b4a792d93843b654f9924f215b3c761e45852c33de4db22ba0109acc0>, size_bytes: 138 }` completed successfully
09:52:00.56 [TRACE] Session end task `remote cache write Digest { hash:Fingerprint<d686e419484f63fd28c18983ea994cf6fbe751ddb4ab55b80190698993438701>, size_bytes: 142 }` completed successfully
Can you re-run with
--stats-log
and provide it here? The stats will contain a histogram of the remote cache blob sizes. I am wondering if Pants is trying to write a large blob or not. (Although with the streaming API, that should be fine.) But it can help refine the inquiry into the issue.
l
an error unrelated to remote cache
Yeah, this one happens when my dist folder is not empty. Not really too big of an issue.
The only diff when running with
--stats-log
Copy code
13:44:55.99 [INFO] Counters:
  backtrack_attempts: 0
  docker_execution_errors: 0
  docker_execution_requests: 0
  docker_execution_successes: 0
  local_cache_read_errors: 0
  local_cache_requests: 0
  local_cache_requests_cached: 0
  local_cache_requests_uncached: 0
  local_cache_total_time_saved_ms: 0
  local_cache_write_errors: 0
  local_execution_requests: 16
  local_process_total_time_run_ms: 52139
  remote_cache_read_errors: 0
  remote_cache_request_timeouts: 0
  remote_cache_requests: 8
  remote_cache_requests_cached: 6
  remote_cache_requests_uncached: 1
  remote_cache_speculation_local_completed_first: 1
  remote_cache_speculation_remote_completed_first: 6
  remote_cache_total_time_saved_ms: 271
  remote_cache_write_attempts: 2
  remote_cache_write_errors: 1
  remote_cache_write_successes: 1
  remote_execution_errors: 0
  remote_execution_requests: 0
  remote_execution_rpc_errors: 0
  remote_execution_rpc_execute: 0
  remote_execution_rpc_retries: 0
  remote_execution_rpc_wait_execution: 0
  remote_execution_success: 0
  remote_execution_timeouts: 0
  remote_process_total_time_run_ms: 0
  remote_store_missing_digest: 0
  remote_store_request_timeouts: 0
I’m gonna run it again now by increasing the MaxRecvMsgSize on the server to compare the outputs
f
actually you will need to add
hdrhistogram
to
[GLOBAL].plugins
in
pants.toml
, that will enable the histograms
l
Sorry, I was analyzing the log when there is no limit on the server (setting MaxRecvMsgSize to a ridiculously high value). This is only part of the log bc it got too big with the upload of all files. You can see that 1) there are a lot more files being uploaded to the cache and 2) the size_bytes make a lot more sense.
And here’s the histogram of when the server has the default limit (when there are errors uploading the cache:
Copy code
14:51:03.56 [INFO] Observation histogram summaries:
14:51:03.73 [INFO] Summary of `local_store_read_blob_size` observation histogram:
  min: 1
  max: 520959
  mean: 3861.533
  std dev: 20559.834
  total observations: 104811
  sum: 404731094
  p25: 236
  p50: 645
  p75: 1598
  p90: 5435
  p95: 11359
  p99: 64383
14:51:03.74 [INFO] Summary of `remote_store_blob_bytes_uploaded` observation histogram:
  min: 0
  max: 428
  mean: 103.370
  std dev: 98.918
  total observations: 27
  sum: 2791
  p25: 6
  p50: 125
  p75: 139
  p90: 142
  p95: 304
  p99: 428
14:51:03.80 [INFO] Summary of `local_process_time_run_ms` observation histogram:
  min: 6
  max: 49183
  mean: 3122.812
  std dev: 11888.894
  total observations: 16
  sum: 49965
  p25: 22
  p50: 39
  p75: 93
  p90: 107
  p95: 166
  p99: 49183
14:51:03.86 [INFO] Summary of `remote_cache_get_action_result_time_micros` observation histogram:
  min: 11464
  max: 42079
  mean: 25393.714
  std dev: 12466.941
  total observations: 7
  sum: 177756
  p25: 13039
  p50: 18527
  p75: 36319
  p90: 40063
  p95: 42079
  p99: 42079
14:51:03.92 [INFO] Summary of `remote_cache_get_action_result_network_time_micros` observation histogram:
  min: 6852
  max: 39839
  mean: 22771.143
  std dev: 12556.225
  total observations: 7
  sum: 159398
  p25: 10415
  p50: 18159
  p75: 33151
  p90: 36991
  p95: 39839
  p99: 39839
14:51:04.06 [INFO] Summary of `local_store_read_blob_time_micros` observation histogram:
  min: 13
  max: 3938303
  mean: 2100973.634
  std dev: 889304.076
  total observations: 104811
  sum: 220205147530
  p25: 1538047
  p50: 1998847
  p75: 2609151
  p90: 3567615
  p95: 3817471
  p99: 3907583
14:51:04.07 [TRACE] deregistering event source from poller
f
Copy code
14:51:03.74 [INFO] Summary of `remote_store_blob_bytes_uploaded` observation histogram:
  min: 0
  max: 428
  mean: 103.370
  std dev: 98.918
  total observations: 27
  sum: 2791
  p25: 6
  p50: 125
  p75: 139
  p90: 142
  p95: 304
  p99: 428
max of 428 bytes
very strange
makes me wonder which part of the grpc stack is returning that error
can you enable trace logging in bazel-remote?
It would be good to inquire into the server side of the remote cache RPCs
and can you run wireshark to capture a packet trace of the remote cache connection?
I would like to know exactly what is transpiring between Pants and bazel-remote.
l
I’m trying to get a verbose log in bazel-remote. Looking for a flag in the code right now.
👍 1
f
(and I just need the decoded protocol log, not the actual bytes of the packet trace so no confidential data will leave your environment)
1
You should also be able to run
tcpdump
to capture a packet trace and then view it with Wireshark.
l
Alright, I just got a dump of everything going in and out of port 9092 during a pants run. There’s quite a bit of info here, anything in particular you’d want to look at?
Untitled
f
You'll need to run the trace file through Wireshark and have it decode the HTTP/2 and gRPC.
TCP isn't much help 🙂
(you can output a trace file with the
-o
option to
tcpdump
)
l
I’m trying to decode the grpc messages but wireshark isn’t recognizing the protofiles
https://grpc.io/blog/wireshark/ seems useful documentation
l
Yep, that’s what I’m following. For some reason it’s not finding the protofiles
👍 1
f
is there a particular error at all?
you would also need the Google
ByteStream
gRPC protobuf
clone both repos and set Wireshark search path to the top of each protobuf directory tree in each repo?
l
Ah, finally got the error. It’s saying google/protobuf/duration.proto wasn’t found. I cloned the protobuf repo as well and also added it to the search paths
f
yeah REAPI has lots of different protobuf definitions
l
HA! Got it
Such a mess… lol
Untitled
And a version without the TCP messages
f
Assuming the numbers like 91, 100, 125 are chunk sizes, nothing in the trace stands out as a huge number.
and annoyingly lots of
Dissector bug, protocol HTTP2: epan/dissectors/packet-http2.c:3086: failed assertion "!((pinfo)->fd->visited) && datalen == length"
l
Assuming the numbers like 91, 100, 125 are chunk sizes, nothing in the trace stands out as a huge number.
Yeah, those are the package lengths
f
so at this point, trace logging in bazel-remote would be useful. given bazel-remote uses grpc-go, https://github.com/grpc/grpc-go#how-to-turn-on-logging should be relevant to get gRPC level logging
also, for context, I just want to confirm when the
ResourceExhausted
is occurring.
and the "dissector bug" errors will prevent us from seeing that in the wireshark decode
l
Nothing useful on the logs
f
is bazel-remote a single process?
the thought being: is bazel-remote passing through the
ResourceExhausted
error from somewhere else?
l
Yeah, it’s a single process
Shouldn’t be passing anything through with this config. I could proxy it to an s3 bucket but in this case I’m saving everything locally only
f
I've got no ideas currently.
any way, time for sleep given my time zone.
l
Well, thank you anyways @fast-nail-55400 ! I’ll report to bazel-remote and see if they have any ideas
Good night!
💤 1
I opened an issue on the bazel-remote repo asking if he’d be accepting a PR to let the user set the max message size. But he has a point that the issue is more on the pants side since we’re sending an oddly large message. https://github.com/buchgr/bazel-remote/issues/658