https://pantsbuild.org/ logo
#general
Title
# general
m

mysterious-waiter-14207

09/19/2022, 6:44 PM
I've been playing with remote cache for the last few days and I have two thoughts: 1. Some processes should not be remote cached? like finding binaries on the running machine? maybe new flag
remote_cache
which default to true and false on some very light processes? 2. Should some process not speculate (local process race with remote cache) because they are too heavy to start? I'm thinking mainly nailgun, java processes or tests.
1
w

witty-crayon-22786

09/19/2022, 7:08 PM
1. Some processes should not be remote cached? like finding binaries on the running machine? maybe new flag
remote_cache
which default to true and false on some very light processes?
yea, this would be a useful optimization. currently these processes are marked as “per session”, but the implementation of that flag is actually to use some random information that will cause them to miss, rather than actually skipping the cache for them: see https://github.com/pantsbuild/pants/blob/f6b2832539e26ddffb19a7d665b9cef697abd0da/src/rust/engine/process_execution/src/remote.rs#L891-L903
2. Should some process not speculate (local process race with remote cache) because they are too heavy to start? I’m thinking mainly nailgun, java processes or tests.
maybe… but the heavier the process is to start, the sooner you want to try and start it if the network is going to take a while to return. the issue is more to do with cancellation being expensive: this issue discusses how to make cancellation cheaper: https://github.com/pantsbuild/pants/issues/14937
yea, this would be a useful optimization.
note though that because of speculation, this should probably not affect performance: it just causes some unnecessary network traffic during that phase
m

mysterious-waiter-14207

09/19/2022, 7:37 PM
I think it might have very small impact on performance because it takes a slot of the remote cache concurrency. It can also affect the performance of the remote cache server.
regarding the issue https://github.com/pantsbuild/pants/issues/14937: So the nailgun process would continue and try to build the canceled process and only if that slot is required would kill it? in heavy compilation project (we have around 900 coarsened scala targets), I think that solution might be equal to the current state, especially as the next process would start immediately, because of the speculate read.
w

witty-crayon-22786

09/19/2022, 8:05 PM
So the nailgun process would continue and try to build the canceled process and only if that slot is required would kill it?
that’s right.
in heavy compilation project (we have around 900 coarsened scala targets), I think that solution might be equal to the current state, especially as the next process would start immediately, because of the speculate read.
yea, this is true. #14937 is probably more useful for local iteration, when you’re not compiling your entire graph from scratch. what the optimal strategy is depends a bit on your cache hit rate: • if you’re going to hit for all cache entries, then you didn’t need to speculate at all • if you’re going to miss for all cache entries, then speculating definitely helped
but i think that right now the cost of killing a nailgun server is not accounted for… it’s possible that nailgun processes should have a local speculation delay included on them, that would prevent the local process from starting for a configurable amount of time during remote cache speculation
m

mysterious-waiter-14207

09/19/2022, 8:09 PM
in theory, we can have an algorithm to estimate the cache hit: 1. For leafs in the DAG - assume cache hit 2. For any other node - if all dependencies were cache hit - assume cache hit
I'm not sure how to implement the delay, but I can make a PR to not speculate on nailgun processes for now?
w

witty-crayon-22786

09/19/2022, 8:13 PM
i’ll open a ticket about a delay. because “not speculating on nailgun processes” is equivalent to an infinite delay
m

mysterious-waiter-14207

09/19/2022, 8:38 PM
thanks
small pr to address the first point: https://github.com/pantsbuild/pants/pull/16920
w

witty-crayon-22786

09/19/2022, 8:46 PM
thanks!
i can add more detail to https://github.com/pantsbuild/pants/issues/16921 if that would be helpful. the
select
loop will be a little bit tricky, but not too bad
m

mysterious-waiter-14207

09/19/2022, 8:58 PM
I'm trying something, will have a draft soon, that the general idea:
Copy code
in_workunit!(
      "remote_cache_read_speculation",
      Level::Trace,
      |workunit| async move {
        tokio::select! {
          cache_result = cache_read_future => {
            self.handle_cache_read_completed(cache_result).await
          }
          s = tokio::time::sleep(tokio::time::Duration::from_millis(100)) => {
            tokio::select! {
              cache_result = cache_read_future => {
                self.handle_cache_read_completed(cache_result).await
              }
              local_result = &mut local_execution_future => {
                workunit.increment_counter(Metric::RemoteCacheSpeculationLocalCompletedFirst, 1);
                local_result.map(|res| (res, false))
              }
            }
          }          
        }
      }
    ).await
  }
w

witty-crayon-22786

09/19/2022, 9:05 PM
Yea, that should do it too.
thanks!
m

mysterious-waiter-14207

09/19/2022, 9:10 PM
almost, I'm getting
Copy code
error[E0382]: use of moved value: `cache_read_future`
   --> process_execution/src/remote_cache.rs:303:30
    |
298 |           cache_result = cache_read_future => {
    |                          ----------------- value moved here
...
303 |               cache_result = cache_read_future => {
    |                              ^^^^^^^^^^^^^^^^^ value used here after move
    |
for the second select
w

witty-crayon-22786

09/19/2022, 9:16 PM
the first one can maybe be:
Copy code
cache_result = &cache_read_future => {
m

mysterious-waiter-14207

09/19/2022, 9:18 PM
cool, that helped. Now I have:
Copy code
Compiling process_execution v0.0.1 (/Users/somdoron/git/pants/src/rust/engine/process_execution)
error[E0277]: `&Pin<Box<dyn futures::Future<Output = std::option::Option<FallibleProcessResultWithPlatform>> + std::marker::Send>>` is not a future
   --> process_execution/src/remote_cache.rs:293:9
    |
293 | /         tokio::select! {
294 | |           cache_result = &cache_read_future => {
295 | |             self.handle_cache_read_completed(workunit, cache_lookup_start, cache_result, local_execution_future).await
296 | |           }
...   |
307 | |           }
308 | |         }
    | |_________^ `&Pin<Box<dyn futures::Future<Output = std::option::Option<FallibleProcessResultWithPlatform>> + std::marker::Send>>` is not a future
calling it today, will continue tomorrow
w

witty-crayon-22786

09/19/2022, 9:23 PM
thanks, have a good night!
🙏 1
m

mysterious-waiter-14207

09/19/2022, 9:39 PM
ahh, amazing! I will add the configuration part from the process tomorrow, that should be easy.
w

witty-crayon-22786

09/19/2022, 9:39 PM
thank you!
m

mysterious-waiter-14207

09/21/2022, 9:03 PM
I'm getting very good result with this, running without local cache or pantsd, fully cached build from remote cache in 25 seconds.
❤️ 2
🎉 1
Copy code
remote_cache_read_errors: 0
  remote_cache_requests: 2129
  remote_cache_requests_cached: 2112
  remote_cache_requests_uncached: 2
  remote_cache_speculation_local_completed_first: 15
  remote_cache_speculation_remote_completed_first: 2112
  remote_cache_total_time_saved_ms: 8588882
  remote_cache_write_attempts: 17
  remote_cache_write_errors: 0
  remote_cache_write_successes: 17
w

witty-crayon-22786

09/21/2022, 9:16 PM
Awesome to hear: thanks a lot for the patches to make it happen!
🙏 1
b

busy-vase-39202

09/21/2022, 9:58 PM
Woot!
2 Views