I m getting an error when uploading a cache with pants that Pants #general

I’m getting an error when uploading a cache with p...

lemon-noon-33245

05/08/2023, 7:20 PM

I’m getting an error when uploading a cache with pants that is being caused by pants not respecting the max Grpc message size:

Copy code

[WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5812170 vs. 4194304)"

If I increase the max receivable size on the server from the 4MB default, the cache starts working. Theoretically, AFAIK, client and server should automatically agree on a max message size and honor that throughout the message exchange. I tried looking up how pants is using

tonic

but couldn’t see anything exotic there in terms of message size. Has anyone seen anything like that happen?

fast-nail-55400

05/08/2023, 7:31 PM

The cache write in question is probably using the REAPI "batch" API. You can set

--remote-store-batch-api-size-limit

to lower the threshold before the streaming API is used.

fast-nail-55400

05/08/2023, 7:32 PM

The default is 4,194,304 bytes.

lemon-noon-33245

05/08/2023, 7:33 PM

The default value for remote-store-batch-api-size-limit is

according to docs. If that was the case it should never be above that, right? I’m getting

in this particular error messages but already had times where I saw close to 10MB

lemon-noon-33245

05/08/2023, 7:34 PM

I tried explicitly setting

remote-store-batch-api-size-limit

but it seems to not limit the size

fast-nail-55400

05/08/2023, 7:36 PM

I forget the semantics of how that option works (and I'm the one who wrote it ...).

😄 1

fast-nail-55400

05/08/2023, 7:38 PM

The gRPC stuff could definitely stand to have a review at some point of what its semantics currently are. But someone would have to pick up work.

fast-nail-55400

05/08/2023, 7:39 PM

https://github.com/pantsbuild/pants/blob/7e6728e5412962674d3c2f96dff23990916f3c0c/src/rust/engine/fs/store/src/remote.rs#L257-L280

fast-nail-55400

05/08/2023, 7:39 PM

As I originally wrote it, it was supposed to respect the configured limit from the server.

fast-nail-55400

05/08/2023, 7:40 PM

Is your remote cache set to return the preferred value in the REAPI

GetCapabilites

response?

lemon-noon-33245

05/08/2023, 7:44 PM

It was sending 0 (as if it was unlimited) but per the author’s suggestion I set it to 4194304 and rebuilt but no good. Let me try printing what comes back in

self.get_capabilities

to check what it’s returning

fast-nail-55400

05/08/2023, 7:44 PM

If the streaming API is triggering the too big error, then that would be a different issue.

lemon-noon-33245

05/08/2023, 7:45 PM

If the streaming API is triggering the too big error, then that would be a different issue.

How can I hone in to know which API is triggering the error?

fast-nail-55400

05/08/2023, 7:46 PM

Turn on Rust logging and enable logging in

tonic

fast-nail-55400

05/08/2023, 7:46 PM

RUST_LOG

should work.

✅ 1

fast-nail-55400

05/08/2023, 7:47 PM

at least I think so. working from memory only.

fast-nail-55400

05/08/2023, 7:48 PM

although some of that might interact with the Pants logging code, so you might need

-ldebug

-ltrace

enabled. (Been a while since I had turned on Rust logging.)

lemon-noon-33245

05/08/2023, 7:53 PM

RUST_LOG=trace

adds a little more info but nothing on grpc even with

-ltrace

. Any other ideas?

lemon-noon-33245

05/08/2023, 8:33 PM

After struggling a little bit to get logs I ended up just writing the output to a file so we can quickly see what’s going on. Turns out

max_batch_total_size_bytes

seems to be being sent correctly. This is what I get printing ServerCapabilities:

Copy code

ServerCapabilities { cache_capabilities: Some(CacheCapabilities { digest_functions: [Sha256], action_cache_update_capabilities: Some(ActionCacheUpdateCapabilities { update_enabled: true }), cache_priority_capabilities: Some(PriorityCapa
bilities { priorities: [PriorityRange { min_priority: 0, max_priority: 0 }] }), max_batch_total_size_bytes: 4194304, symlink_absolute_path_strategy: Allowed, supported_compressors: [Zstd], supported_batch_update_compressors: [Zstd] }), 
execution_capabilities: None, deprecated_api_version: None, low_api_version: Some(SemVer { major: 2, minor: 0, patch: 0, prerelease: "" }), high_api_version: Some(SemVer { major: 2, minor: 3, patch: 0, prerelease: "" }) }

fast-nail-55400

05/08/2023, 8:49 PM

Looking at the error again, I believe

grpc: received message larger than max

is gRPC level config. I had suggested the REAPI level options since it might have an indirect effect on message size. I am not sure Pants configures any message size limits in

tonic

itself. (https://github.com/pantsbuild/pants/tree/main/src/rust/engine/grpc_util is relevant Pants crate.)

fast-nail-55400

05/08/2023, 8:50 PM

Might be worth seeing if

tonic

has a config we should be using if we are not already using it?

lemon-noon-33245

05/08/2023, 9:03 PM

https://github.com/pantsbuild/pants/pull/12537 According to this discussion, there’s nothing we need to do in tonic specifically. From what I understand in the code base, if the message is larger than the max batch size, pants should use the streaming api, right?

lemon-noon-33245

05/08/2023, 9:04 PM

Tonic does have a max message size that you can set up but I think that it shouldn’t be necessary if we are using the streaming api instead

fast-nail-55400

05/08/2023, 9:12 PM

pants should use the streaming api, right?

It should (and the logs would show

ByteStream

RPCs). Of course, it begs the question whether the gRPC max message size is being exceeded for the

ByteStream

API stream messages.

fast-nail-55400

05/08/2023, 9:14 PM

Which remote cache server are you using?

lemon-noon-33245

05/08/2023, 10:04 PM

I’ve been using

bazel-remote

. I’d love to see the rpc logs but can’t turn them on for the life of me. I’m calling pants with:

RUST_LOG=trace pants export-codegen //pkg:build -ltrace --keep-sandboxes=on_failure --no-local-cache

Am I missing anything here?

fast-nail-55400

05/08/2023, 10:54 PM

You need

--log-show-rust-3rdparty

fast-nail-55400

05/08/2023, 10:55 PM

(did a quick search to find it, I had forgotten the name)

fast-nail-55400

05/08/2023, 11:01 PM

And you could probably enable logging on the bazel-remote side?

lemon-noon-33245

05/09/2023, 2:11 PM

Got the pants logs here. Don’t know enough about how the remote caching works in pants but it seems to me that it trying to use the streaming api. But I don’t see any errors popping up. Does this give you any further clues?

log.txt

fast-nail-55400

05/09/2023, 4:34 PM

Copy code

09:51:57.86 [ERROR] 1 Exception encountered:

Engine traceback:
  in `export-codegen` goal

IntrinsicError: Failed to create symlink to ../../../../node_modules/.pnpm/@amplitude+analytics-browser@1.9.1/node_modules/@amplitude/analytics-browser at /Users/theoribeiro/Dev/trinio/trinio/dist/codegen/node/apps/checkout/node_modul
es/@amplitude/analytics-browser: File exists (os error 17)

an error unrelated to remote cache

fast-nail-55400

05/09/2023, 4:40 PM

and remote cache error:

Copy code

09:52:00.56 [WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5812170 vs. 4194304)"

fast-nail-55400

05/09/2023, 4:42 PM

weirdly the three session end writes to remote cache are all small:

Copy code

09:51:57.86 [TRACE] Session end task `remote cache write Digest { hash: Fingerprint<089ee036609c36c38e6abcca42dd7e5039b7bc29b5e2051a71735a724e41b610>, size_bytes: 138 }` completed successfully
09:51:57.86 [TRACE] Session end task `remote cache write Digest { hash: Fingerprint<8a44c11b4a792d93843b654f9924f215b3c761e45852c33de4db22ba0109acc0>, size_bytes: 138 }` completed successfully
09:52:00.56 [TRACE] Session end task `remote cache write Digest { hash:Fingerprint<d686e419484f63fd28c18983ea994cf6fbe751ddb4ab55b80190698993438701>, size_bytes: 142 }` completed successfully

fast-nail-55400

05/09/2023, 4:44 PM

Can you re-run with

--stats-log

and provide it here? The stats will contain a histogram of the remote cache blob sizes. I am wondering if Pants is trying to write a large blob or not. (Although with the streaming API, that should be fine.) But it can help refine the inquiry into the issue.

lemon-noon-33245

05/09/2023, 5:42 PM

an error unrelated to remote cache

Yeah, this one happens when my dist folder is not empty. Not really too big of an issue.

lemon-noon-33245

05/09/2023, 5:49 PM

The only diff when running with

--stats-log

Copy code

13:44:55.99 [INFO] Counters:
  backtrack_attempts: 0
  docker_execution_errors: 0
  docker_execution_requests: 0
  docker_execution_successes: 0
  local_cache_read_errors: 0
  local_cache_requests: 0
  local_cache_requests_cached: 0
  local_cache_requests_uncached: 0
  local_cache_total_time_saved_ms: 0
  local_cache_write_errors: 0
  local_execution_requests: 16
  local_process_total_time_run_ms: 52139
  remote_cache_read_errors: 0
  remote_cache_request_timeouts: 0
  remote_cache_requests: 8
  remote_cache_requests_cached: 6
  remote_cache_requests_uncached: 1
  remote_cache_speculation_local_completed_first: 1
  remote_cache_speculation_remote_completed_first: 6
  remote_cache_total_time_saved_ms: 271
  remote_cache_write_attempts: 2
  remote_cache_write_errors: 1
  remote_cache_write_successes: 1
  remote_execution_errors: 0
  remote_execution_requests: 0
  remote_execution_rpc_errors: 0
  remote_execution_rpc_execute: 0
  remote_execution_rpc_retries: 0
  remote_execution_rpc_wait_execution: 0
  remote_execution_success: 0
  remote_execution_timeouts: 0
  remote_process_total_time_run_ms: 0
  remote_store_missing_digest: 0
  remote_store_request_timeouts: 0

lemon-noon-33245

05/09/2023, 5:50 PM

I’m gonna run it again now by increasing the MaxRecvMsgSize on the server to compare the outputs

fast-nail-55400

05/09/2023, 6:40 PM

actually you will need to add

hdrhistogram

[GLOBAL].plugins

pants.toml

, that will enable the histograms

fast-nail-55400

05/09/2023, 6:41 PM

https://github.com/pantsbuild/pants/blob/a4ef195b9978d2e783d6b8ea6e57fc1fcd2663a9/pants.toml#L42

lemon-noon-33245

05/09/2023, 6:47 PM

Sorry, I was analyzing the log when there is no limit on the server (setting MaxRecvMsgSize to a ridiculously high value). This is only part of the log bc it got too big with the upload of all files. You can see that 1) there are a lot more files being uploaded to the cache and 2) the size_bytes make a lot more sense.

log2.txt

lemon-noon-33245

05/09/2023, 6:53 PM

And here’s the histogram of when the server has the default limit (when there are errors uploading the cache:

Copy code

14:51:03.56 [INFO] Observation histogram summaries:
14:51:03.73 [INFO] Summary of `local_store_read_blob_size` observation histogram:
  min: 1
  max: 520959
  mean: 3861.533
  std dev: 20559.834
  total observations: 104811
  sum: 404731094
  p25: 236
  p50: 645
  p75: 1598
  p90: 5435
  p95: 11359
  p99: 64383
14:51:03.74 [INFO] Summary of `remote_store_blob_bytes_uploaded` observation histogram:
  min: 0
  max: 428
  mean: 103.370
  std dev: 98.918
  total observations: 27
  sum: 2791
  p25: 6
  p50: 125
  p75: 139
  p90: 142
  p95: 304
  p99: 428
14:51:03.80 [INFO] Summary of `local_process_time_run_ms` observation histogram:
  min: 6
  max: 49183
  mean: 3122.812
  std dev: 11888.894
  total observations: 16
  sum: 49965
  p25: 22
  p50: 39
  p75: 93
  p90: 107
  p95: 166
  p99: 49183
14:51:03.86 [INFO] Summary of `remote_cache_get_action_result_time_micros` observation histogram:
  min: 11464
  max: 42079
  mean: 25393.714
  std dev: 12466.941
  total observations: 7
  sum: 177756
  p25: 13039
  p50: 18527
  p75: 36319
  p90: 40063
  p95: 42079
  p99: 42079
14:51:03.92 [INFO] Summary of `remote_cache_get_action_result_network_time_micros` observation histogram:
  min: 6852
  max: 39839
  mean: 22771.143
  std dev: 12556.225
  total observations: 7
  sum: 159398
  p25: 10415
  p50: 18159
  p75: 33151
  p90: 36991
  p95: 39839
  p99: 39839
14:51:04.06 [INFO] Summary of `local_store_read_blob_time_micros` observation histogram:
  min: 13
  max: 3938303
  mean: 2100973.634
  std dev: 889304.076
  total observations: 104811
  sum: 220205147530
  p25: 1538047
  p50: 1998847
  p75: 2609151
  p90: 3567615
  p95: 3817471
  p99: 3907583
14:51:04.07 [TRACE] deregistering event source from poller

fast-nail-55400

05/09/2023, 6:53 PM

Copy code

14:51:03.74 [INFO] Summary of `remote_store_blob_bytes_uploaded` observation histogram:
  min: 0
  max: 428
  mean: 103.370
  std dev: 98.918
  total observations: 27
  sum: 2791
  p25: 6
  p50: 125
  p75: 139
  p90: 142
  p95: 304
  p99: 428

fast-nail-55400

05/09/2023, 6:53 PM

max of 428 bytes

fast-nail-55400

05/09/2023, 6:54 PM

very strange

fast-nail-55400

05/09/2023, 6:55 PM

makes me wonder which part of the grpc stack is returning that error

fast-nail-55400

05/09/2023, 6:55 PM

can you enable trace logging in bazel-remote?

fast-nail-55400

05/09/2023, 6:56 PM

It would be good to inquire into the server side of the remote cache RPCs

fast-nail-55400

05/09/2023, 6:57 PM

and can you run wireshark to capture a packet trace of the remote cache connection?

fast-nail-55400

05/09/2023, 7:00 PM

I would like to know exactly what is transpiring between Pants and bazel-remote.

lemon-noon-33245

05/09/2023, 7:01 PM

I’m trying to get a verbose log in bazel-remote. Looking for a flag in the code right now.

👍 1

fast-nail-55400

05/09/2023, 7:01 PM

(and I just need the decoded protocol log, not the actual bytes of the packet trace so no confidential data will leave your environment)

✅ 1

fast-nail-55400

05/09/2023, 7:03 PM

You should also be able to run

tcpdump

to capture a packet trace and then view it with Wireshark.

lemon-noon-33245

05/09/2023, 7:17 PM

Alright, I just got a dump of everything going in and out of port 9092 during a pants run. There’s quite a bit of info here, anything in particular you’d want to look at?

lemon-noon-33245

05/09/2023, 7:18 PM

Untitled

fast-nail-55400

05/09/2023, 7:22 PM

You'll need to run the trace file through Wireshark and have it decode the HTTP/2 and gRPC.

fast-nail-55400

05/09/2023, 7:23 PM

TCP isn't much help 🙂

fast-nail-55400

05/09/2023, 7:26 PM

(you can output a trace file with the

-o

option to

tcpdump

)

lemon-noon-33245

05/09/2023, 7:27 PM

I’m trying to decode the grpc messages but wireshark isn’t recognizing the protofiles

fast-nail-55400

05/09/2023, 7:28 PM

https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto is the relevant protobuf file

fast-nail-55400

05/09/2023, 7:29 PM

https://grpc.io/blog/wireshark/ seems useful documentation

lemon-noon-33245

05/09/2023, 7:30 PM

Yep, that’s what I’m following. For some reason it’s not finding the protofiles

👍 1

fast-nail-55400

05/09/2023, 7:30 PM

is there a particular error at all?

fast-nail-55400

05/09/2023, 7:31 PM

you would also need the Google

ByteStream

gRPC protobuf

fast-nail-55400

05/09/2023, 7:32 PM

https://github.com/googleapis/googleapis/blob/master/google/bytestream/bytestream.proto

fast-nail-55400

05/09/2023, 7:32 PM

clone both repos and set Wireshark search path to the top of each protobuf directory tree in each repo?

lemon-noon-33245

05/09/2023, 7:37 PM

Ah, finally got the error. It’s saying google/protobuf/duration.proto wasn’t found. I cloned the protobuf repo as well and also added it to the search paths

fast-nail-55400

05/09/2023, 7:39 PM

yeah REAPI has lots of different protobuf definitions

lemon-noon-33245

05/09/2023, 7:47 PM

HA! Got it

lemon-noon-33245

05/09/2023, 7:47 PM

Such a mess… lol

lemon-noon-33245

05/09/2023, 7:49 PM

Untitled

lemon-noon-33245

05/09/2023, 7:51 PM

And a version without the TCP messages

Untitled

fast-nail-55400

05/09/2023, 7:55 PM

Assuming the numbers like 91, 100, 125 are chunk sizes, nothing in the trace stands out as a huge number.

fast-nail-55400

05/09/2023, 7:55 PM

and annoyingly lots of

Dissector bug, protocol HTTP2: epan/dissectors/packet-http2.c:3086: failed assertion "!((pinfo)->fd->visited) && datalen == length"

lemon-noon-33245

05/09/2023, 7:57 PM

Assuming the numbers like 91, 100, 125 are chunk sizes, nothing in the trace stands out as a huge number.

Yeah, those are the package lengths

fast-nail-55400

05/09/2023, 8:00 PM

so at this point, trace logging in bazel-remote would be useful. given bazel-remote uses grpc-go, https://github.com/grpc/grpc-go#how-to-turn-on-logging should be relevant to get gRPC level logging

fast-nail-55400

05/09/2023, 8:01 PM

also, for context, I just want to confirm when the

ResourceExhausted

is occurring.

fast-nail-55400

05/09/2023, 8:02 PM

and the "dissector bug" errors will prevent us from seeing that in the wireshark decode

lemon-noon-33245

05/09/2023, 8:06 PM

Nothing useful on the logs

fast-nail-55400

05/09/2023, 8:09 PM

is bazel-remote a single process?

fast-nail-55400

05/09/2023, 8:10 PM

the thought being: is bazel-remote passing through the

ResourceExhausted

error from somewhere else?

lemon-noon-33245

05/09/2023, 8:10 PM

Yeah, it’s a single process

lemon-noon-33245

05/09/2023, 8:10 PM

Shouldn’t be passing anything through with this config. I could proxy it to an s3 bucket but in this case I’m saving everything locally only

fast-nail-55400

05/09/2023, 8:12 PM

I've got no ideas currently.

fast-nail-55400

05/09/2023, 8:13 PM

any way, time for sleep given my time zone.

lemon-noon-33245

05/09/2023, 8:14 PM

Well, thank you anyways @fast-nail-55400 ! I’ll report to bazel-remote and see if they have any ideas

lemon-noon-33245

05/09/2023, 8:14 PM

Good night!

💤 1

lemon-noon-33245

05/09/2023, 9:09 PM

I opened an issue on the bazel-remote repo asking if he’d be accepting a PR to let the user set the max message size. But he has a point that the issue is more on the pants side since we’re sending an oddly large message. https://github.com/buchgr/bazel-remote/issues/658

3 Views

Open in Slack

Previous Next