Our CI system is getting absolutely trashed with errors like Pants #general

Our CI system is getting absolutely trashed with e...

high-yak-85899

10/13/2022, 2:36 PM

Our CI system is getting absolutely trashed with errors like this after upgrading to 2.13.0. Can I please get some assistance on this? It's breaking daily.

Copy code

10:42:12.54 [INFO] Initialization options changed: reinitializing scheduler...
10:42:13.10 [INFO] Scheduler initialized.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/buildbot/.pex/unzipped_pexes/7907cee4a80399696d31dbee748702a492b91a80/__main__.py", line 102, in <module>
    from pex.pex_bootstrapper import bootstrap_pex
ModuleNotFoundError: No module named 'pex'

high-yak-85899

10/13/2022, 2:38 PM

This is from running a simple command like

./pants run <some cli tool> -- <its necessary arguments>

high-yak-85899

10/13/2022, 2:40 PM

I don't see anything in the docs about the

.pex

folder so I'm hesitant to just start throwing things like

--no-local-cache

at it.

happy-kitchen-89482

10/13/2022, 2:49 PM

Looking

happy-kitchen-89482

10/13/2022, 2:50 PM

Oh, previous posts I remembered about this were from you

happy-kitchen-89482

10/13/2022, 2:50 PM

And this is unrelated to

export

, sounds like

happy-kitchen-89482

10/13/2022, 2:51 PM

And this is happening frequently but not 100% of the time?

happy-kitchen-89482

10/13/2022, 2:54 PM

And John believes it’s https://github.com/pantsbuild/pants/issues/16778

high-yak-85899

10/13/2022, 2:55 PM

It has happened with our export step that we run to bootstrap developer environments (and some things for CI). It only happens on CI as far as I'm aware. We got through export by detecting if pants complained about module not found and then wiped the

.pex

directory and retried the goal. But doing that everywhere is going to be tedious.

high-yak-85899

10/13/2022, 2:55 PM

I could add a build step for everyone to start by wiping whatever

.pex

directory they have.

happy-kitchen-89482

10/13/2022, 2:57 PM

Are you caching and restoring the lmdb store directory in CI?

witty-crayon-22786

10/13/2022, 3:55 PM

that

.pex

directory is not the one that pants manages

witty-crayon-22786

10/13/2022, 3:56 PM

(the one that Pants manages is under

~/.cache/pants/named_caches/pex_root

)

witty-crayon-22786

10/13/2022, 3:56 PM

so: what exactly is it that is crashing? what are you running?

witty-crayon-22786

10/13/2022, 3:56 PM

ah…

./pants run $something

witty-crayon-22786

10/13/2022, 3:57 PM

so… is something pruning or otherwise cleaning up the

~/.pex

directory? and if you

./pants package $something

and then run the pex repeatedly, is it reproducible?

high-yak-85899

10/13/2022, 4:31 PM

Rperoducing is tough since this only happens on our builders. We do have a single container that can be running multiple jobs at once with different clones of the same repository. To mitigate this happening on our

export

goal, we manually wipe

~/.pex

and then

export

succeeds. So there's probably some buried concurrency issue where one

./pants

call is working while this

./pants run

gets invoked.

high-yak-85899

10/13/2022, 4:44 PM

We're also seeing these workers have multiple

pantsd

processes when inspecting the machines

high-yak-85899

10/13/2022, 4:54 PM

My concern is that since pants isn't intentionally managing

~/.pex

but clearly still using it, there's no way to reliably prevent collisions if we run multiple jobs on the same machine with different clones of our repository.

witty-crayon-22786

10/13/2022, 5:20 PM

To mitigate this happening on our
export
goal, we manually wipe
~/.pex
and then
export
succeeds.

how are you manually wiping the directory? are you doing so in a concurrency safe way?

witty-crayon-22786

10/13/2022, 5:20 PM

partial deletions of the directory can cause this exact issue

witty-crayon-22786

10/13/2022, 5:23 PM

the reason this directory is being used is that

run

is implemented as (almost) “build a PEX and then run it un-sandboxed”… when you run a PEX outside of Pants’ sandbox, it does not use pants’ config (and shouldn’t)

high-yak-85899

10/13/2022, 5:32 PM

It's safe as long as builders are at a similar step. We have a bootstrap step in our build process that grabs that file lock and, if we get hit with this error, wipes the directory and reruns the step. However, any later step on another job can be somewhere that invokes a

./pants run

while a separate worker is bootstrapping. It's not clear when this directory becomes invalidated: just that it does.

high-yak-85899

10/13/2022, 5:34 PM

As a workaround, and maybe this is what you meant above, would you suggest running a

package

before we

run

our steps? Or is there some kind of legacy action we can invoke? I'm worried about chasing down all the places we call

run

and finding the appropriate

dist

path to invoke.

witty-crayon-22786

10/13/2022, 5:34 PM

a file lock will only block other processes which are aware of / obeying it: do you have one process acquiring this lock to wipe the

.pex

directory while other processes are running

./pants run

? the

./pants

pex

processes won’t know anything about the lock

witty-crayon-22786

10/13/2022, 5:35 PM

As a workaround, and maybe this is what you meant above, would you suggest running a
package
before we
run
our steps?

that won’t fix the problem, per-se: just allow you to see whether it is reproducible (because the resulting PEX file fails for every attempt) or just a race condition.

high-yak-85899

10/13/2022, 5:38 PM

Correct, that file lock only works if everyone is on the same bootstrap step. Otherwise, every invocation of pants has to be wrapped in a • grab lock • try thing I wanted • if failed, reset

.pex

• try again • release lock I'm sure we can make it happen, but I wouldn't consider it reliable/easy to implement.

witty-crayon-22786

10/13/2022, 5:42 PM

the wiping of

.pex

sounds problematic in general. it’s just plain hard in the presence of concurrency so it seems like backing up and fixing

export

so that you don’t need to wipe

.pex

would be best. is there an issue filed for the

export

issue specifically? https://github.com/pantsbuild/pants/issues/16778#issuecomment-1276294801 is probably an unrelated issue

high-yak-85899

10/13/2022, 5:43 PM

I haven't made one yet. I'll get on that. Though, I'm worried that

run

is doing the same thing. This happened before and we only recently implemented this patch to try to remedy the situation.

high-yak-85899

10/13/2022, 6:29 PM

Had to step out for a moment, but https://github.com/pantsbuild/pants/issues/17221 is up now.

4 Views

Open in Slack

Previous Next