Is this how we’ll implement retries on the engine ...
# development
h
Is this how we’ll implement retries on the engine side? Add a new attribute to
ExecuteProcessRequest
called
max_number_attempts
. When the
ExecuteProcessResult
has an exit code != 0, have engine retry up to n times? Would we surface any mechanism for retry predicates, ie only retry it condition x is met? Or could we keep it simple and retry whenever the return code != 0
a
so for python run instant reload we obv can just use
--loop
since we just need pants to manage restarting it, and we're not using that as part of another rule
h
(Context of this is CI pytest runner)
a
i'm not sure what it would be for jupyter notebooks since i'm not sure (as in, it could be true or false) whether we could service everything jupyter needs from pants performantly as a
@console_rule
invocation
oh well we can already do retries like you describe by just requesting
FallibleExecuteProcessResult
instead
but if a persistent connection is required, i would probably do what i'm doing with my bloop compile work right now and just have my rule loop over some input stream
we can absolutely implement retries on the engine side though -- i'm thinking that's something we could do with another type of
CommandRunner
(cc @early-needle-54791), although i don't know if that's the right level of abstraction to do it at
👍 1
would depend on whether we care about things like the output from each failed attempt, or whether we just want to send back the result of the first successful request
i think "flaky process execution" or "flaky url fetch" sounds like something we might want to be able to represent in the engine in a generic way (e.g., off the top of my head, adding an intrinsic which accepts a dataclass containing a process execution request + "flakiness metadata"), which might then be applicable to the ci pytest runner
i think we can do all of that in python v2 rules though
and if we can, it's better to do that
h
Would
FallibleExecuteProcessResult
still cache the failure? That’s the main issue we need to workaround. Almost home and can check
a
if it doesn't already, i would absolutely think we would want to avoid caching any failed process execution
i can also check
h
We do cache failed
ExecuteProcessResult
. Were we to stop doing this, then we’d be safe to have flakes because we could restart the CI shard and get different results. The issue now is that restarting the shard doesn’t change anything due to the cache
a
yes
but that's so easy
oh
you also mean in the in-memory node memoization
additionally, in the persistent process execution cache, we persist failed executions
👍 1
which might be more correct, but it's not the behavior i think i personally want
i think
ok, well, for a console rule or something that is intended to execute a process, we can add something extra to invalidate the process request upon failure
h
additionally, in the persistent process execution cache, we persist failed executions
Yes, this is the cause of the major problem breaking CI. The solutions are to either stop caching failures or to implement a formal retry mechanism as described in this thread’s first post
a
i don't think that's necessarily all possible solutions
one thing you could do might be:
Copy code
@rule(A, [...]):
def f(...):
  def make_uncached_exe_req(index):
    return ExecuteProcessRequest(
      argv=(...),
      env={'_CACHE_KEY_ENV_VAR': str(index)},
      ...)
  for i in range(0, 10):
    res = yield Get(FallibleExecuteProcessResult, ExecuteProcessRequest, make_uncached_exe_req(i))
    # do something with `res`...
i believe that should avoid having failed executions being cached