Is this how we ll implement retries on the engine side Add a Pants #development

Is this how we’ll implement retries on the engine ...

hundreds-father-404

09/14/2019, 6:20 PM

Is this how we’ll implement retries on the engine side? Add a new attribute to

ExecuteProcessRequest

called

max_number_attempts

. When the

ExecuteProcessResult

has an exit code != 0, have engine retry up to n times? Would we surface any mechanism for retry predicates, ie only retry it condition x is met? Or could we keep it simple and retry whenever the return code != 0

aloof-angle-91616

09/14/2019, 6:25 PM

so for python run instant reload we obv can just use

--loop

since we just need pants to manage restarting it, and we're not using that as part of another rule

hundreds-father-404

09/14/2019, 6:26 PM

(Context of this is CI pytest runner)

aloof-angle-91616

09/14/2019, 6:26 PM

i'm not sure what it would be for jupyter notebooks since i'm not sure (as in, it could be true or false) whether we could service everything jupyter needs from pants performantly as a

@console_rule

invocation

aloof-angle-91616

09/14/2019, 6:27 PM

oh well we can already do retries like you describe by just requesting

FallibleExecuteProcessResult

instead

aloof-angle-91616

09/14/2019, 6:28 PM

but if a persistent connection is required, i would probably do what i'm doing with my bloop compile work right now and just have my rule loop over some input stream

aloof-angle-91616

09/14/2019, 6:30 PM

we can absolutely implement retries on the engine side though -- i'm thinking that's something we could do with another type of

CommandRunner

(cc @early-needle-54791), although i don't know if that's the right level of abstraction to do it at

👍 1

aloof-angle-91616

09/14/2019, 6:30 PM

would depend on whether we care about things like the output from each failed attempt, or whether we just want to send back the result of the first successful request

aloof-angle-91616

09/14/2019, 6:32 PM

i think "flaky process execution" or "flaky url fetch" sounds like something we might want to be able to represent in the engine in a generic way (e.g., off the top of my head, adding an intrinsic which accepts a dataclass containing a process execution request + "flakiness metadata"), which might then be applicable to the ci pytest runner

aloof-angle-91616

09/14/2019, 6:33 PM

i think we can do all of that in python v2 rules though

aloof-angle-91616

09/14/2019, 6:33 PM

and if we can, it's better to do that

hundreds-father-404

09/14/2019, 6:43 PM

Would

FallibleExecuteProcessResult

still cache the failure? That’s the main issue we need to workaround. Almost home and can check

aloof-angle-91616

09/14/2019, 6:43 PM

if it doesn't already, i would absolutely think we would want to avoid caching any failed process execution

aloof-angle-91616

09/14/2019, 6:44 PM

i can also check

hundreds-father-404

09/14/2019, 6:45 PM

We do cache failed

ExecuteProcessResult

. Were we to stop doing this, then we’d be safe to have flakes because we could restart the CI shard and get different results. The issue now is that restarting the shard doesn’t change anything due to the cache

aloof-angle-91616

09/14/2019, 6:45 PM

yes

aloof-angle-91616

09/14/2019, 6:45 PM

but that's so easy

aloof-angle-91616

09/14/2019, 6:45 PM

aloof-angle-91616

09/14/2019, 6:45 PM

you also mean in the in-memory node memoization

aloof-angle-91616

09/14/2019, 6:46 PM

additionally, in the persistent process execution cache, we persist failed executions

👍 1

aloof-angle-91616

09/14/2019, 6:46 PM

which might be more correct, but it's not the behavior i think i personally want

aloof-angle-91616

09/14/2019, 6:46 PM

i think

aloof-angle-91616

09/14/2019, 6:47 PM

ok, well, for a console rule or something that is intended to execute a process, we can add something extra to invalidate the process request upon failure

hundreds-father-404

09/14/2019, 6:47 PM

additionally, in the persistent process execution cache, we persist failed executions

Yes, this is the cause of the major problem breaking CI. The solutions are to either stop caching failures or to implement a formal retry mechanism as described in this thread’s first post

aloof-angle-91616

09/14/2019, 6:47 PM

i don't think that's necessarily all possible solutions

aloof-angle-91616

09/14/2019, 6:50 PM

one thing you could do might be:

Copy code

@rule(A, [...]):
def f(...):
  def make_uncached_exe_req(index):
    return ExecuteProcessRequest(
      argv=(...),
      env={'_CACHE_KEY_ENV_VAR': str(index)},
      ...)
  for i in range(0, 10):
    res = yield Get(FallibleExecuteProcessResult, ExecuteProcessRequest, make_uncached_exe_req(i))
    # do something with `res`...

aloof-angle-91616

09/14/2019, 6:51 PM

i believe that should avoid having failed executions being cached

Open in Slack

Previous Next