I've been playing with remote cache for the last f...
# general
m
I've been playing with remote cache for the last few days and I have two thoughts: 1. Some processes should not be remote cached? like finding binaries on the running machine? maybe new flag
remote_cache
which default to true and false on some very light processes? 2. Should some process not speculate (local process race with remote cache) because they are too heavy to start? I'm thinking mainly nailgun, java processes or tests.
1
w
1. Some processes should not be remote cached? like finding binaries on the running machine? maybe new flag
remote_cache
which default to true and false on some very light processes?
yea, this would be a useful optimization. currently these processes are marked as “per session”, but the implementation of that flag is actually to use some random information that will cause them to miss, rather than actually skipping the cache for them: see https://github.com/pantsbuild/pants/blob/f6b2832539e26ddffb19a7d665b9cef697abd0da/src/rust/engine/process_execution/src/remote.rs#L891-L903
2. Should some process not speculate (local process race with remote cache) because they are too heavy to start? I’m thinking mainly nailgun, java processes or tests.
maybe… but the heavier the process is to start, the sooner you want to try and start it if the network is going to take a while to return. the issue is more to do with cancellation being expensive: this issue discusses how to make cancellation cheaper: https://github.com/pantsbuild/pants/issues/14937
yea, this would be a useful optimization.
note though that because of speculation, this should probably not affect performance: it just causes some unnecessary network traffic during that phase
m
I think it might have very small impact on performance because it takes a slot of the remote cache concurrency. It can also affect the performance of the remote cache server.
regarding the issue https://github.com/pantsbuild/pants/issues/14937: So the nailgun process would continue and try to build the canceled process and only if that slot is required would kill it? in heavy compilation project (we have around 900 coarsened scala targets), I think that solution might be equal to the current state, especially as the next process would start immediately, because of the speculate read.
w
So the nailgun process would continue and try to build the canceled process and only if that slot is required would kill it?
that’s right.
in heavy compilation project (we have around 900 coarsened scala targets), I think that solution might be equal to the current state, especially as the next process would start immediately, because of the speculate read.
yea, this is true. #14937 is probably more useful for local iteration, when you’re not compiling your entire graph from scratch. what the optimal strategy is depends a bit on your cache hit rate: • if you’re going to hit for all cache entries, then you didn’t need to speculate at all • if you’re going to miss for all cache entries, then speculating definitely helped
but i think that right now the cost of killing a nailgun server is not accounted for… it’s possible that nailgun processes should have a local speculation delay included on them, that would prevent the local process from starting for a configurable amount of time during remote cache speculation
m
in theory, we can have an algorithm to estimate the cache hit: 1. For leafs in the DAG - assume cache hit 2. For any other node - if all dependencies were cache hit - assume cache hit
I'm not sure how to implement the delay, but I can make a PR to not speculate on nailgun processes for now?
w
i’ll open a ticket about a delay. because “not speculating on nailgun processes” is equivalent to an infinite delay
m
thanks
small pr to address the first point: https://github.com/pantsbuild/pants/pull/16920
w
thanks!
i can add more detail to https://github.com/pantsbuild/pants/issues/16921 if that would be helpful. the
select
loop will be a little bit tricky, but not too bad
m
I'm trying something, will have a draft soon, that the general idea:
Copy code
in_workunit!(
      "remote_cache_read_speculation",
      Level::Trace,
      |workunit| async move {
        tokio::select! {
          cache_result = cache_read_future => {
            self.handle_cache_read_completed(cache_result).await
          }
          s = tokio::time::sleep(tokio::time::Duration::from_millis(100)) => {
            tokio::select! {
              cache_result = cache_read_future => {
                self.handle_cache_read_completed(cache_result).await
              }
              local_result = &mut local_execution_future => {
                workunit.increment_counter(Metric::RemoteCacheSpeculationLocalCompletedFirst, 1);
                local_result.map(|res| (res, false))
              }
            }
          }          
        }
      }
    ).await
  }
w
Yea, that should do it too.
thanks!
m
almost, I'm getting
Copy code
error[E0382]: use of moved value: `cache_read_future`
   --> process_execution/src/remote_cache.rs:303:30
    |
298 |           cache_result = cache_read_future => {
    |                          ----------------- value moved here
...
303 |               cache_result = cache_read_future => {
    |                              ^^^^^^^^^^^^^^^^^ value used here after move
    |
for the second select
w
the first one can maybe be:
Copy code
cache_result = &cache_read_future => {
m
cool, that helped. Now I have:
Copy code
Compiling process_execution v0.0.1 (/Users/somdoron/git/pants/src/rust/engine/process_execution)
error[E0277]: `&Pin<Box<dyn futures::Future<Output = std::option::Option<FallibleProcessResultWithPlatform>> + std::marker::Send>>` is not a future
   --> process_execution/src/remote_cache.rs:293:9
    |
293 | /         tokio::select! {
294 | |           cache_result = &cache_read_future => {
295 | |             self.handle_cache_read_completed(workunit, cache_lookup_start, cache_result, local_execution_future).await
296 | |           }
...   |
307 | |           }
308 | |         }
    | |_________^ `&Pin<Box<dyn futures::Future<Output = std::option::Option<FallibleProcessResultWithPlatform>> + std::marker::Send>>` is not a future
calling it today, will continue tomorrow
w
thanks, have a good night!
🙏 1
m
ahh, amazing! I will add the configuration part from the process tomorrow, that should be easy.
w
thank you!
m
I'm getting very good result with this, running without local cache or pantsd, fully cached build from remote cache in 25 seconds.
❤️ 2
🎉 1
Copy code
remote_cache_read_errors: 0
  remote_cache_requests: 2129
  remote_cache_requests_cached: 2112
  remote_cache_requests_uncached: 2
  remote_cache_speculation_local_completed_first: 15
  remote_cache_speculation_remote_completed_first: 2112
  remote_cache_total_time_saved_ms: 8588882
  remote_cache_write_attempts: 17
  remote_cache_write_errors: 0
  remote_cache_write_successes: 17
w
Awesome to hear: thanks a lot for the patches to make it happen!
🙏 1
b
Woot!