I ve been playing with remote cache for the last few days an Pants #general

I've been playing with remote cache for the last f...

mysterious-waiter-14207

09/19/2022, 6:44 PM

I've been playing with remote cache for the last few days and I have two thoughts: 1. Some processes should not be remote cached? like finding binaries on the running machine? maybe new flag

remote_cache

which default to true and false on some very light processes? 2. Should some process not speculate (local process race with remote cache) because they are too heavy to start? I'm thinking mainly nailgun, java processes or tests.

✅ 1

witty-crayon-22786

09/19/2022, 7:08 PM

1. Some processes should not be remote cached? like finding binaries on the running machine? maybe new flag
remote_cache
which default to true and false on some very light processes?

yea, this would be a useful optimization. currently these processes are marked as “per session”, but the implementation of that flag is actually to use some random information that will cause them to miss, rather than actually skipping the cache for them: see https://github.com/pantsbuild/pants/blob/f6b2832539e26ddffb19a7d665b9cef697abd0da/src/rust/engine/process_execution/src/remote.rs#L891-L903

witty-crayon-22786

09/19/2022, 7:10 PM

2. Should some process not speculate (local process race with remote cache) because they are too heavy to start? I’m thinking mainly nailgun, java processes or tests.

maybe… but the heavier the process is to start, the sooner you want to try and start it if the network is going to take a while to return. the issue is more to do with cancellation being expensive: this issue discusses how to make cancellation cheaper: https://github.com/pantsbuild/pants/issues/14937

witty-crayon-22786

09/19/2022, 7:11 PM

yea, this would be a useful optimization.

note though that because of speculation, this should probably not affect performance: it just causes some unnecessary network traffic during that phase

mysterious-waiter-14207

09/19/2022, 7:37 PM

I think it might have very small impact on performance because it takes a slot of the remote cache concurrency. It can also affect the performance of the remote cache server.

mysterious-waiter-14207

09/19/2022, 7:45 PM

regarding the issue https://github.com/pantsbuild/pants/issues/14937: So the nailgun process would continue and try to build the canceled process and only if that slot is required would kill it? in heavy compilation project (we have around 900 coarsened scala targets), I think that solution might be equal to the current state, especially as the next process would start immediately, because of the speculate read.

witty-crayon-22786

09/19/2022, 8:05 PM

So the nailgun process would continue and try to build the canceled process and only if that slot is required would kill it?

that’s right.

in heavy compilation project (we have around 900 coarsened scala targets), I think that solution might be equal to the current state, especially as the next process would start immediately, because of the speculate read.

yea, this is true. #14937 is probably more useful for local iteration, when you’re not compiling your entire graph from scratch. what the optimal strategy is depends a bit on your cache hit rate: • if you’re going to hit for all cache entries, then you didn’t need to speculate at all • if you’re going to miss for all cache entries, then speculating definitely helped

witty-crayon-22786

09/19/2022, 8:08 PM

but i think that right now the cost of killing a nailgun server is not accounted for… it’s possible that nailgun processes should have a local speculation delay included on them, that would prevent the local process from starting for a configurable amount of time during remote cache speculation

mysterious-waiter-14207

09/19/2022, 8:09 PM

in theory, we can have an algorithm to estimate the cache hit: 1. For leafs in the DAG - assume cache hit 2. For any other node - if all dependencies were cache hit - assume cache hit

mysterious-waiter-14207

09/19/2022, 8:11 PM

I'm not sure how to implement the delay, but I can make a PR to not speculate on nailgun processes for now?

witty-crayon-22786

09/19/2022, 8:13 PM

i’ll open a ticket about a delay. because “not speculating on nailgun processes” is equivalent to an infinite delay

witty-crayon-22786

09/19/2022, 8:32 PM

filed: https://github.com/pantsbuild/pants/issues/16921

mysterious-waiter-14207

09/19/2022, 8:38 PM

thanks

mysterious-waiter-14207

09/19/2022, 8:39 PM

small pr to address the first point: https://github.com/pantsbuild/pants/pull/16920

witty-crayon-22786

09/19/2022, 8:46 PM

thanks!

witty-crayon-22786

09/19/2022, 8:47 PM

i can add more detail to https://github.com/pantsbuild/pants/issues/16921 if that would be helpful. the

select

loop will be a little bit tricky, but not too bad

mysterious-waiter-14207

09/19/2022, 8:58 PM

I'm trying something, will have a draft soon, that the general idea:

Copy code

in_workunit!(
      "remote_cache_read_speculation",
      Level::Trace,
      |workunit| async move {
        tokio::select! {
          cache_result = cache_read_future => {
            self.handle_cache_read_completed(cache_result).await
          }
          s = tokio::time::sleep(tokio::time::Duration::from_millis(100)) => {
            tokio::select! {
              cache_result = cache_read_future => {
                self.handle_cache_read_completed(cache_result).await
              }
              local_result = &mut local_execution_future => {
                workunit.increment_counter(Metric::RemoteCacheSpeculationLocalCompletedFirst, 1);
                local_result.map(|res| (res, false))
              }
            }
          }          
        }
      }
    ).await
  }

witty-crayon-22786

09/19/2022, 9:05 PM

Yea, that should do it too.

witty-crayon-22786

09/19/2022, 9:05 PM

thanks!

mysterious-waiter-14207

09/19/2022, 9:10 PM

almost, I'm getting

Copy code

error[E0382]: use of moved value: `cache_read_future`
   --> process_execution/src/remote_cache.rs:303:30
    |
298 |           cache_result = cache_read_future => {
    |                          ----------------- value moved here
...
303 |               cache_result = cache_read_future => {
    |                              ^^^^^^^^^^^^^^^^^ value used here after move
    |

mysterious-waiter-14207

09/19/2022, 9:10 PM

for the second select

witty-crayon-22786

09/19/2022, 9:16 PM

the first one can maybe be:

Copy code

cache_result = &cache_read_future => {

mysterious-waiter-14207

09/19/2022, 9:18 PM

cool, that helped. Now I have:

Copy code

Compiling process_execution v0.0.1 (/Users/somdoron/git/pants/src/rust/engine/process_execution)
error[E0277]: `&Pin<Box<dyn futures::Future<Output = std::option::Option<FallibleProcessResultWithPlatform>> + std::marker::Send>>` is not a future
   --> process_execution/src/remote_cache.rs:293:9
    |
293 | /         tokio::select! {
294 | |           cache_result = &cache_read_future => {
295 | |             self.handle_cache_read_completed(workunit, cache_lookup_start, cache_result, local_execution_future).await
296 | |           }
...   |
307 | |           }
308 | |         }
    | |_________^ `&Pin<Box<dyn futures::Future<Output = std::option::Option<FallibleProcessResultWithPlatform>> + std::marker::Send>>` is not a future

mysterious-waiter-14207

09/19/2022, 9:22 PM

I opened a draft here: https://github.com/pantsbuild/pants/pull/16922

mysterious-waiter-14207

09/19/2022, 9:23 PM

calling it today, will continue tomorrow

witty-crayon-22786

09/19/2022, 9:23 PM

thanks, have a good night!

🙏 1

witty-crayon-22786

09/19/2022, 9:35 PM

pushed a fix: needed to be

&mut

: https://github.com/pantsbuild/pants/pull/16922/commits/36b5c0f7f19c085dddcc0980dc0371d00f2d8b69

mysterious-waiter-14207

09/19/2022, 9:39 PM

ahh, amazing! I will add the configuration part from the process tomorrow, that should be easy.

witty-crayon-22786

09/19/2022, 9:39 PM

thank you!

mysterious-waiter-14207

09/21/2022, 9:03 PM

I'm getting very good result with this, running without local cache or pantsd, fully cached build from remote cache in 25 seconds.

❤️ 2

🎉 1

mysterious-waiter-14207

09/21/2022, 9:04 PM

Copy code

remote_cache_read_errors: 0
  remote_cache_requests: 2129
  remote_cache_requests_cached: 2112
  remote_cache_requests_uncached: 2
  remote_cache_speculation_local_completed_first: 15
  remote_cache_speculation_remote_completed_first: 2112
  remote_cache_total_time_saved_ms: 8588882
  remote_cache_write_attempts: 17
  remote_cache_write_errors: 0
  remote_cache_write_successes: 17

witty-crayon-22786

09/21/2022, 9:16 PM

Awesome to hear: thanks a lot for the patches to make it happen!

🙏 1

busy-vase-39202

09/21/2022, 9:58 PM

Woot!

2 Views

Open in Slack

Previous Next