Our CI system is getting absolutely trashed with e...
# general
h
Our CI system is getting absolutely trashed with errors like this after upgrading to 2.13.0. Can I please get some assistance on this? It's breaking daily.
Copy code
10:42:12.54 [INFO] Initialization options changed: reinitializing scheduler...
10:42:13.10 [INFO] Scheduler initialized.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/buildbot/.pex/unzipped_pexes/7907cee4a80399696d31dbee748702a492b91a80/__main__.py", line 102, in <module>
    from pex.pex_bootstrapper import bootstrap_pex
ModuleNotFoundError: No module named 'pex'
This is from running a simple command like
./pants run <some cli tool> -- <its necessary arguments>
I don't see anything in the docs about the
.pex
folder so I'm hesitant to just start throwing things like
--no-local-cache
at it.
h
Looking
Oh, previous posts I remembered about this were from you
And this is unrelated to
export
, sounds like
And this is happening frequently but not 100% of the time?
h
It has happened with our export step that we run to bootstrap developer environments (and some things for CI). It only happens on CI as far as I'm aware. We got through export by detecting if pants complained about module not found and then wiped the
.pex
directory and retried the goal. But doing that everywhere is going to be tedious.
I could add a build step for everyone to start by wiping whatever
.pex
directory they have.
h
Are you caching and restoring the lmdb store directory in CI?
w
that
.pex
directory is not the one that pants manages
(the one that Pants manages is under
~/.cache/pants/named_caches/pex_root
)
so: what exactly is it that is crashing? what are you running?
ah…
./pants run $something
so… is something pruning or otherwise cleaning up the
~/.pex
directory? and if you
./pants package $something
and then run the pex repeatedly, is it reproducible?
h
Rperoducing is tough since this only happens on our builders. We do have a single container that can be running multiple jobs at once with different clones of the same repository. To mitigate this happening on our
export
goal, we manually wipe
~/.pex
and then
export
succeeds. So there's probably some buried concurrency issue where one
./pants
call is working while this
./pants run
gets invoked.
We're also seeing these workers have multiple
pantsd
processes when inspecting the machines
My concern is that since pants isn't intentionally managing
~/.pex
but clearly still using it, there's no way to reliably prevent collisions if we run multiple jobs on the same machine with different clones of our repository.
w
To mitigate this happening on our
export
goal, we manually wipe
~/.pex
and then
export
succeeds.
how are you manually wiping the directory? are you doing so in a concurrency safe way?
partial deletions of the directory can cause this exact issue
the reason this directory is being used is that
run
is implemented as (almost) “build a PEX and then run it un-sandboxed”… when you run a PEX outside of Pants’ sandbox, it does not use pants’ config (and shouldn’t)
h
It's safe as long as builders are at a similar step. We have a bootstrap step in our build process that grabs that file lock and, if we get hit with this error, wipes the directory and reruns the step. However, any later step on another job can be somewhere that invokes a
./pants run
while a separate worker is bootstrapping. It's not clear when this directory becomes invalidated: just that it does.
As a workaround, and maybe this is what you meant above, would you suggest running a
package
before we
run
our steps? Or is there some kind of legacy action we can invoke? I'm worried about chasing down all the places we call
run
and finding the appropriate
dist
path to invoke.
w
a file lock will only block other processes which are aware of / obeying it: do you have one process acquiring this lock to wipe the
.pex
directory while other processes are running
./pants run
? the
./pants
/
pex
processes won’t know anything about the lock
As a workaround, and maybe this is what you meant above, would you suggest running a
package
before we
run
our steps?
that won’t fix the problem, per-se: just allow you to see whether it is reproducible (because the resulting PEX file fails for every attempt) or just a race condition.
h
Correct, that file lock only works if everyone is on the same bootstrap step. Otherwise, every invocation of pants has to be wrapped in a • grab lock • try thing I wanted • if failed, reset
.pex
• try again • release lock I'm sure we can make it happen, but I wouldn't consider it reliable/easy to implement.
w
the wiping of
.pex
sounds problematic in general. it’s just plain hard in the presence of concurrency so it seems like backing up and fixing
export
so that you don’t need to wipe
.pex
would be best. is there an issue filed for the
export
issue specifically? https://github.com/pantsbuild/pants/issues/16778#issuecomment-1276294801 is probably an unrelated issue
h
I haven't made one yet. I'll get on that. Though, I'm worried that
run
is doing the same thing. This happened before and we only recently implemented this patch to try to remedy the situation.
Had to step out for a moment, but https://github.com/pantsbuild/pants/issues/17221 is up now.