for the past ~9 days we’ve been seeing an error ru...
# general
s
for the past ~9 days we’ve been seeing an error running Pants only in CI:
Copy code
ImportError: No module named __pants_df_parser
when it hits, it hits at the beginning of a job during dependency-parsing (we use
--changed-since
) so I don’t think it’s a resource-exhaustion issue. the jobs always succeed on a retry. at this point I’m trying to export more data from the CI jobs as artifacts to help me figure out what’s going on - does anyone have pointers? I’m thinking of running with
--no-process-cleanup
and exporting the
PANTS_LOCAL_EXECUTION_ROOT_DIR
- should I also grab named caches?
A full representative error is:
Copy code
13:28:44.08 [INFO] Starting: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:56.61 [INFO] Completed: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:57.16 [INFO] [auth-plugin] Successfully loaded Toolchain token from env var 'TOOLCHAIN_AUTH_TOKEN', expiration: 2022-12-20T20:52:09+00:00.
13:29:17.25 [INFO] Starting: Building dockerfile_parser.pex from dockerfile-parser_default.lock
13:29:18.32 [INFO] Canceled: Building dockerfile_parser.pex from dockerfile-parser_default.lock
Exception caught: (pants.engine.internals.scheduler.ExecutionError)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 115, in main
    PantsLoader.main()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 111, in main
    cls.run_default_entrypoint()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 93, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 99, in run
    runner = LocalPantsRunner.create(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/local_pants_runner.py", line 159, in create
    specs = calculate_specs(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/init/specs_calculator.py", line 94, in calculate_specs
    (changed_addresses,) = session.product_request(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 577, in product_request
    return self.execute(request)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 521, in execute
    self._raise_on_error([t for _, t in throws])
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 498, in _raise_on_error
    raise ExecutionError(
Exception message: 1 Exception encountered:
Engine traceback:
  in select
  in pants.vcs.changed.find_changed_owners
  in pants.backend.project_info.dependees.find_dependees
  in pants.backend.project_info.dependees.map_addresses_to_dependees
  in pants.engine.internals.graph.resolve_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.util_rules.dependencies.inject_docker_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/process.py", line 272, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:
stderr:
Traceback (most recent call last):
  File "/tmp/pants-sandbox-kVePcL/.cache/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex", line 234, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser
w
hey @sparse-lifeguard-95737: sorry about this.
afaict, this is a PEX not having some content, OR the venv created for the PEX being corrupt in some way… capturing the actual error’d PEX is probably the next step. if it contains the file, then the error is somewhere in how the venv was created
…hm, so. i think that if you can investigate the venv in the error (
/tmp/pants-sandbox-kVePcL/.cache/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex
), it might actually lead you back to a full copy of the PEX somehow?
/tmp/pants-sandbox-kVePcL/.cache/pex_root
will end up as a symlink to
~/.cache/pants/named_caches/pex_root
, so after a failed run, looking at
e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex
in there would tell you more.
sometimes this error can occur because the
venv
was built for the wrong version (and thus has the file in the wrong
site-packages
), so it’s not necessarily the case that the file doesn’t exist at all.
cc @enough-analyst-54434: ^
regarding the “last 9 days” aspect of this: are you able to narrow that down to a particular version of Pants or PEX?
…either way: either 1. getting the PEX, or 2. inspecting the created venv … will rule out one thing or another
e
sometimes this error can occur because the
venv
was built for the wrong version (and thus has the file in the wrong
site-packages
), so it’s not necessarily the case that the file doesn’t exist at all.
What's this? I have never heard of such a thing.
w
e
@#$%! I would love to get someone on a Mac to test that!
l
note our failures are all on linux
e
Ok, yes. So @loud-laptop-17949 the relevant similarity would be a Python patch version upgrade. Say 3.8.11 -> 3.8.12
Anything like that in the mentioned time period?
w
could confirm the issue by inspecting
site-packages
in the venv, so worth https://pantsbuild.slack.com/archives/C046T6T9U/p1658515259983489?thread_ts=1658412143.380059&amp;cid=C046T6T9U probably
e
I need to page more in here. This is in CI, its flaky, etc. So lots of variables.
l
@witty-crayon-22786 dan and I are brainstorming how to get more inspection. Being CI with ephemeral environments, just getting the ability to inspect is hard
s
the version of Python has been constant for months (well before we started seeing this error)
looking at history of python versions…
w
capturing the entire
~/.cache/pants/named_caches/pex_root/venvs
directory for failed runs might work if it’s not too large…?
…or even just
find ~/.cache/pants/named_caches/pex_root/venvs > files.txt
perhaps?
s
yeah that’s my next idea. tried it yesterday but ended up bombing out workflows that failed for legit reasons 😬 currently working on a version that detects this specific error and only then does the upload
👍 1
w
can then look to see if the module exists (and in the right
site-packages
) for the relevant venv
🧠 1
thanks a lot for the help in getting to the bottom of this.
s
will set up the
find
right away, then keep rolling on the conditional upload
w
mm, just edited my pastes above: it’s
~/.cache/pants/*named_caches*/pex_root/venvs
s
closing the loop on affected pants/pex versions: • It started on
PANTS_SHA=5d8a328d72209863986c8959b20305505bc068ba
• I saw there was a concurrency bug in pex, so I upgraded to
v2.1.99
. The error kept happening • This morning I upgraded us to Pants
v2.13.0rc0
. Error is still occurring
The commit where the error started didn’t change any pants configs, but it did bump us up to 16 parallel shards for
./pants test
. not sure if that could be related
w
are you self-hosted? i.e., could multiple shards be using the same
$HOME
?
s
we are self-hosted, but each job runs in its own k8s pod. no shared disk between them (unless something very bad is happening with ephemeral storage on the nodes)
👍 1
e
I notice
/tmp/pants_setup/bootstrap-Linux-x86_64/
which means the pants cache dir is
/tmp
- is that true?
s
yes. we’re running with:
Copy code
PANTS_LOCAL_EXECUTION_ROOT_DIR: /tmp
PANTS_SETUP_CACHE: /tmp/pants_setup
PANTS_NAMED_CACHES_DIR: /tmp/pants_named_caches
PANTS_LOCAL_STORE_DIR: /tmp/pants_local_store
PEX_ROOT: /tmp/pex_root
because in a past iteration of our GHA setup,
$HOME
and
/tmp
were on different file systems
and there was some other concurrency bug caused by non-atomic filesystem operations inside
pip
e
That particular pip issue was fixed, but OK. Thanks. No /tmp cleaner running I'd guess?
👍 1
s
nope. and the error is always
No module named __pants_df_parser
(if there was a cleaner running I think we’d see various failures from different files going missing)
e
Ok. And, finally for the moment, the storage /tmp resides on is normal local and no NFS-like-thing?
👀 1
Pex uses file locks extensively to deal with concurrency. NFS can foil that depending.
s
correct, no NFS etc
the k8s controller we use to run jobs is apparently using an
emptyDir
mount for the location where code is checked out, but all of Pants’ cache and work dirs are in the running container’s file system - is that possibly the problem?
(we’ve been running with this setup since well before the errors started popping up, but we were possibly not running with enough scale to hit it)
e
I'd think this is not the problem if that's been a constant mod scale.
👍 1
Stu's suggestion to grab venv contents is the right path here.
s
ok will try to get that set up by my EOD so we have something to look at on Monday
e
Oh, wait a sec. So there are multiple kernels accessing the same underlying storage? I.E.: multiple containers, same mounted /tmp storage? If so I need to read-up / think. The file-locking only works for 1 kernel IIUC.
s
no /tmp isn’t mounted, it’s all isolated within the container
e
Ok.
s
ha, finally some good luck! the PR to add the
find
list hit the error
👍 1
w
which venv errored?
s
error is coming from
/tmp/pants-sandbox-RzFl42/.cache/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex
de-symlinked I think that’s
/tmp/pants_named_caches/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/
w
hm, so… afaict,
__pants_df_parser.py
is just not there, regardless of
site-packages
. that unfortunately isn’t enough to tell us whether this is “the PEX didn’t contain the file” or “the venv creation was corrupted”
and means that the next step is probably to get the PEX… =/
s
will that be stashed elsewhere in the
named_caches
dir?
w
@enough-analyst-54434: is there some other cache directory that they could snag?
coke
e
The site-packages contents are symlinked to
pex_root/installed_wheels
for 3rdparty, but this is a 1st party source which is copied directly into the venv site-packages dir. So ... nope. Having the original PEX directory - these are all packed layout internal PEXes IIRC, will be the next step.
s
@enough-analyst-54434 could you explain a little more what you mean by “the original PEX directory”? would it be coming out of the pants setup cache?
e
No, that will be in the LMDB store and materialized into the relevant
--no-process-cleanup
process executiuon sandbox tmp dir.
Reviewing above real quick if that has been suggested. Looks like maybe not. But you'd need to run Pants with
PANTS_PROCESS_CLEANUP=False and then grab a bunch of /tmp
dirs it prints out and also the Pants logs to know which one is relevant to look at. That's pretty unweildy though!
s
when the error hits, it’s always at the very beginning of a job, so
/tmp
shouldn’t be too full in that case (we’d just need to be smart about detecting when we hit this error vs. other failures). running with cleanup disabled should probably be fine? I guess we’ll find out if we start blowing up disk space on our nodes
following up on this - we haven’t seen the error since I added
--no-process-cleanup
and the steps to archive
/tmp
on error (of course) wild speculation on my part, but could the bug be some race condition like: 1.
__pants_df_parser.py
is getting extracted in a sandbox somewhere 2. In another sandbox, a symlink is generated pointing at ^^^ 3. The sandbox from step 1 gets cleaned up before the sandbox from step 2 is done using the file
l
given that we throw away these envs when done, running with
--no-process-cleanup
seems like a bearable long-term approach
yeah?
s
yeah the only negative impact we’ve seen from leaving
--no-process-cleanup
on is that the CI logs get flooded with messages about not cleaning up the sandbox dirs
I opened a PR to disable
--no-process-cleanup
because I saw it’s been chewing through 10s of GB of ephemeral storage per job - the PR hit the
ImportError
😂 so I guess it’s still hanging around
l
also confirmation that the cleanup process is suspect?
s
yeah
w
John is out until until Saturday, but:
i can’t think of any reason why cleanup would affect this, other than that the extra IO involved in cleaning up might make a race more likely? the relevant file won’t actually exist inside of the sandbox that is being cleaned up: rather, inside of the
~/.cache/pants/etc/etc
directory
(although clearly it does have an impact for you!)
so… i’ll see if i can get a change out tomorrow morning to convert
--process-cleanup
into a ternary of
none
,
on_success
,
always
s
thank you! I’m so curious to know what’s happening 👀
w
mutable caches man… blugh.
will try to get it cherry-picked to
2.13.x
tonight
Landed on 2.13.x as
30af7f9c0b3473bb857709f5157cfa36a2c98ced
Should have wheels in about an hour.
e
I just re-read all the above in a bit more detail. I will point out, the find listing above is the Pex dogfood venv for its vendored Pip and not the venv for
__df_parser
.
🤔 1
As such, that venv is definitely expected to not have
__df_parser.py
.
s
the error still hasn’t recurred since switching to
--keep-sandboxes=on_failure
- however switching did drop our worst-case ephemeral storage usage from ~56gb to ~9gb 🎉
we can probably keep running with
--keep-sandboxes=on_failure
forever, though it’s annoying to know this bug is floating around out there
e
s/annoying/...something else.../
- letting bugs like this go is never good to do. I've been poking at this today using the example-docker repo upgraded to Pants 2.13.0rc0 to try to get to the bottom of things.
Ah, I did not click though to the full listing you provided. I just did that and see there are multiple venvs there and the relevant one (with
.../dockerfile-3.2.0.dist-info/
) has no
__df_parser.py
😕
Alright, the only likely contenders explaining specifically missing user source code, but not missing dependency code - which is what you have here - are in: + https://github.com/pantsbuild/pex/pull/1240 + https://github.com/pantsbuild/pex/pull/1439 The latter was released in Pex 2.1.48 on September 8th, 2021. The use of a Pex with
__pants_df_parser.py
as the user code was 1st introduced in Pants 2.8.0.dev3 which used Pex 2.1.50; so those
__pants_df_parser.py
internal venv PEXes used by Pants should never have been subject to those two relevant fixed bugs. I felt like I was getting close there but things just don't align. You weren't perchance using a custom version of Pex back on Pants 2.8? I'm still handwaving a bit here but this is the only solid lead I've found inspecting current / Pants 2.13.x + releated Pex code and issue history.
s
nope - this error only started appearing once we started using `PANTS_SHA`s on the 2.13.x branch
the only time we’ve customized the
[pex-cli].version
was to upgrade to v2.1.99 before it had been upgraded on the 2.13.x branch
we hit the
ImportError
today using
--keep-sandboxes=on_failure
but there was a bug with my debugging logic to archive the sandboxes 🤦‍♂️
e
Ok. We'll, next time it comes around I'll be keen to investigate what you capture. Do you cache any part of
named_caches
or
lmdb_store
CI?
s
only via the toolchain remote cache.
PANTS_SETUP_CACHE
is the only thing we cache via GitHub Actions
e
Ok. Thanks.
s
@enough-analyst-54434 caught one! this will extract to a dir named
tmp
, with subdirs
pants_execution
,
pants_local_store
,
pants_named_caches
,
pants_setup
traceback was:
Copy code
Traceback (most recent call last):
  File "/tmp/pants_execution/pants-sandbox-HGLVkg/.cache/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/pex", line 235, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/python/3.8.13/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/opt/python/3.8.13/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser
looks like symlinks and `#!`s are all set assuming the dir tree is at
/tmp
e
Ok, poking at this data now. Can you detail how "looks like symlinks and `#!`s are all set assuming the dir tree is at
/tmp
" is wrong? What is the actual layout?
For example, I see:
Copy code
$ head -1 pants_named_caches/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/pex 
#!/tmp/pants_named_caches/pex_root/venvs/s/b40772c9/venv/bin/python3.8 -sE
If that is wrong, what would you expect it to be instead?
s
the layout in the tarball is correct, but it’s not rooted at
/
(unless you put it there)
extracting the archive gives you a floating
tmp
dir - if you
mv tmp/* /tmp/
then I think everything will line up
e
Ah yes, I know that. So you were just giving me helpful advice and not pointing out Pants / Pex were getting things wrong in production?
s
yup 🙂
e
Ok.
Yeah, so - I need to poke more - but this just seems to scream the mystery but not shed light. The Pants sandbox PEX has the df parser file but the pex_root venv does not. If you run the Pants sandbox PEX it creates a pex_root venv that does include df parser:
Copy code
$ PEX_ROOT=/tmp/check ./pants_execution/pants-sandbox-HGLVkg/dockerfile_parser.pex/__main__.py 
Not enough arguments.
Usage: /tmp/check/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/ddab8011daaee380698ac2fb9701af18c90c03f6/lib/python3.8/site-packages/__pants_df_parser.py [DOCKERFILE ...]
So there still appears to be a mysterious timing bug or else a mysterious /tmp file deleter process. Clearly it must be the former.
Thanks for gathering that data though. It's good to have concrete confirmation of the shape of the mystery.
Ok, the sandbox PEX looks good of course and the test above shows that it creates the expected venv when run, but this check further confirms all the locally cached build results are good:
Copy code
$ fs_util --local-store-path pants_local_store/ directories list | while read hash size; do echo "$hash $size:" && fs_util --local-store-path pants_local_store/ directory cat-proto --output-format recursive-file-list-with-digests $hash $size && echo; done
e39764651f3b4a2adebb0d42d010ed3c374baef0316053c96459adb28eef75b3 92:
a5b7ba120125477c565e5d11347a68eae3aef514e9be61c0b81460bebe85984d 35450            shellcheck-v0.8.0/LICENSE.txt
683a8b60a8a8d10b0300e41750d3a7d24d5e86902b5eb93acbf77b88a059a110 673              shellcheck-v0.8.0/README.txt
f4bce23c11c3919c1b20bcb0f206f6b44c44e26f2bc95f8aa708716095fa0651 6753016          shellcheck-v0.8.0/shellcheck

e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 0:

cc044f85e3e19cde83e7bf79699ecf58f05a250e900e07986d482ba671f77b84 92:
a5698cfa72aff5b90862ee4459cdefa29c9dd0f7a2109fc5523ef900ddf6f016 1014577          pants_plugins.pex/.bootstrap
5799d7f155abdfc236bc755e7d0f8674e273ebc561c2cff929a59625c0898914 160419           pants_plugins.pex/.deps/certifi-2022.6.15-py3-none-any.whl
e0680ddf9702cac6766d3584d9a91693e8a36f0708e331e358f231e9ac57a962 177831           pants_plugins.pex/.deps/chardet-4.0.0-py2.py3-none-any.whl
839fe609a7f39a483b1b83cf6de67a27708941dd3a2d22657c0243ce7a9d0052 40126            pants_plugins.pex/.deps/charset_normalizer-2.1.0-py3-none-any.whl
8446e7b408fbcc6d42cf7f962999ce9297658a34a4ca0f00b891347f7d41e400 61107            pants_plugins.pex/.deps/idna-3.3-py3-none-any.whl
84e0bd768439658df067dd42b48c64f12bc0740765f694c26ccc6fd5db39d77e 62430            pants_plugins.pex/.deps/requests-2.28.1-py3-none-any.whl
deb0df538a4601a46b778a84b8cb0021acc8e9f7a4a8af5ef23de1dc235f2c39 116862           pants_plugins.pex/.deps/toolchain.pants.plugin-0.20.0-py3-none-any.whl
16b671016816877bb4395d8d4766adce43d880a314f8056acca9fb10cae35267 139270           pants_plugins.pex/.deps/urllib3-1.26.11-py2.py3-none-any.whl
8da385ab913b1ea02b996bd98b7e867f1cea21bc64caa77eb1c5d2f6d91574ed 1312             pants_plugins.pex/PEX-INFO
93f4afc1d963285b389959a19e57338891935cc064975e8bfa9ea45c74127c6f 3523             pants_plugins.pex/__main__.py
3b1c69c546c9899212ed5728c52d1dce3093ffa8d635c315810b686dba11d9a5 3497             pants_plugins.pex/__pex__/__init__.py

cc8c0bae9a6f86cd83f56d5b3d6aefdbaff86a81673b9be6b42adf54522bb7c9 418:
a5698cfa72aff5b90862ee4459cdefa29c9dd0f7a2109fc5523ef900ddf6f016 1014577          .bootstrap
5799d7f155abdfc236bc755e7d0f8674e273ebc561c2cff929a59625c0898914 160419           .deps/certifi-2022.6.15-py3-none-any.whl
e0680ddf9702cac6766d3584d9a91693e8a36f0708e331e358f231e9ac57a962 177831           .deps/chardet-4.0.0-py2.py3-none-any.whl
839fe609a7f39a483b1b83cf6de67a27708941dd3a2d22657c0243ce7a9d0052 40126            .deps/charset_normalizer-2.1.0-py3-none-any.whl
8446e7b408fbcc6d42cf7f962999ce9297658a34a4ca0f00b891347f7d41e400 61107            .deps/idna-3.3-py3-none-any.whl
84e0bd768439658df067dd42b48c64f12bc0740765f694c26ccc6fd5db39d77e 62430            .deps/requests-2.28.1-py3-none-any.whl
deb0df538a4601a46b778a84b8cb0021acc8e9f7a4a8af5ef23de1dc235f2c39 116862           .deps/toolchain.pants.plugin-0.20.0-py3-none-any.whl
16b671016816877bb4395d8d4766adce43d880a314f8056acca9fb10cae35267 139270           .deps/urllib3-1.26.11-py2.py3-none-any.whl
8da385ab913b1ea02b996bd98b7e867f1cea21bc64caa77eb1c5d2f6d91574ed 1312             PEX-INFO
93f4afc1d963285b389959a19e57338891935cc064975e8bfa9ea45c74127c6f 3523             __main__.py
3b1c69c546c9899212ed5728c52d1dce3093ffa8d635c315810b686dba11d9a5 3497             __pex__/__init__.py

dff0203fc7cbcc7d0c106339c72050112236288cc6decbab8e33a1d289d5120e 86:
e7dc2e97f71d7da7418434de242128e0a168140dd1fb5916dc0999bcb0516738 3497             __init__.py

fb92dadf6f75576144d6e3d626d6ea631ddffe6a2ea8fba15ae71adda96d9581 515:
a5698cfa72aff5b90862ee4459cdefa29c9dd0f7a2109fc5523ef900ddf6f016 1014577          .bootstrap
3ae6eca84f6f012d771b2bd5e3408c01f69354d13d9e824406571aa697e7ce26 2020947          .deps/dockerfile-3.2.0-cp36-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl
b41f7f362d6ff5c59e550fd691271640a763e08321c4d1d91891d49d7a821684 724              PEX-INFO
f1b226184dc341bedacf21d6ae72cfd6c79a0d0461f6a02d346559d59e21b341 3523             __main__.py
c6e74285af32dda55a159465eac4a8fe0bea2da0ae9b2debb9c1ff9af0ac3ebc 6330             __pants_df_parser.py
e7dc2e97f71d7da7418434de242128e0a168140dd1fb5916dc0999bcb0516738 3497             __pex__/__init__.py

b639524d844690157bff94d33c9efa68f11096058930b20bfa25a4aaba644138 95:
a5b7ba120125477c565e5d11347a68eae3aef514e9be61c0b81460bebe85984d 35450            __extract_archive_dir/shellcheck-v0.8.0/LICENSE.txt
683a8b60a8a8d10b0300e41750d3a7d24d5e86902b5eb93acbf77b88a059a110 673              __extract_archive_dir/shellcheck-v0.8.0/README.txt
f4bce23c11c3919c1b20bcb0f206f6b44c44e26f2bc95f8aa708716095fa0651 6753016          __extract_archive_dir/shellcheck-v0.8.0/shellcheck

bcc48548627a8db05bc36a248abe9c95564684577716484f901bb320f55d9bef 96:
a5698cfa72aff5b90862ee4459cdefa29c9dd0f7a2109fc5523ef900ddf6f016 1014577          dockerfile_parser.pex/.bootstrap
3ae6eca84f6f012d771b2bd5e3408c01f69354d13d9e824406571aa697e7ce26 2020947          dockerfile_parser.pex/.deps/dockerfile-3.2.0-cp36-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl
b41f7f362d6ff5c59e550fd691271640a763e08321c4d1d91891d49d7a821684 724              dockerfile_parser.pex/PEX-INFO
f1b226184dc341bedacf21d6ae72cfd6c79a0d0461f6a02d346559d59e21b341 3523             dockerfile_parser.pex/__main__.py
c6e74285af32dda55a159465eac4a8fe0bea2da0ae9b2debb9c1ff9af0ac3ebc 6330             dockerfile_parser.pex/__pants_df_parser.py
e7dc2e97f71d7da7418434de242128e0a168140dd1fb5916dc0999bcb0516738 3497             dockerfile_parser.pex/__pex__/__init__.py

91b370a34149c63793acd05c9ce0281ac51391539fcf52814938e89a064e1ddc 146:
3ae6eca84f6f012d771b2bd5e3408c01f69354d13d9e824406571aa697e7ce26 2020947          dockerfile-3.2.0-cp36-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl

5bba33b218e573f5409ec0da7e2fdf7f7a14b7825d3bd940ceae2e508c640b1f 86:
3b1c69c546c9899212ed5728c52d1dce3093ffa8d635c315810b686dba11d9a5 3497             __init__.py

1386d00aa9225cf67fff88ffd3158bbb512c5bfcf56dd60cc5ea41947671b2cd 261:
a5b7ba120125477c565e5d11347a68eae3aef514e9be61c0b81460bebe85984d 35450            LICENSE.txt
683a8b60a8a8d10b0300e41750d3a7d24d5e86902b5eb93acbf77b88a059a110 673              README.txt
f4bce23c11c3919c1b20bcb0f206f6b44c44e26f2bc95f8aa708716095fa0651 6753016          shellcheck

180958e93d55b97853b3ee8346773880f99af6e5c7eab05be5bf6bc414f9701c 780:
5799d7f155abdfc236bc755e7d0f8674e273ebc561c2cff929a59625c0898914 160419           certifi-2022.6.15-py3-none-any.whl
e0680ddf9702cac6766d3584d9a91693e8a36f0708e331e358f231e9ac57a962 177831           chardet-4.0.0-py2.py3-none-any.whl
839fe609a7f39a483b1b83cf6de67a27708941dd3a2d22657c0243ce7a9d0052 40126            charset_normalizer-2.1.0-py3-none-any.whl
8446e7b408fbcc6d42cf7f962999ce9297658a34a4ca0f00b891347f7d41e400 61107            idna-3.3-py3-none-any.whl
84e0bd768439658df067dd42b48c64f12bc0740765f694c26ccc6fd5db39d77e 62430            requests-2.28.1-py3-none-any.whl
deb0df538a4601a46b778a84b8cb0021acc8e9f7a4a8af5ef23de1dc235f2c39 116862           toolchain.pants.plugin-0.20.0-py3-none-any.whl
16b671016816877bb4395d8d4766adce43d880a314f8056acca9fb10cae35267 139270           urllib3-1.26.11-py2.py3-none-any.whl
r
I was running into what seemed like very same issue today with pants 2.12.0 in a self hosted AKS github actions runner. I was suspicious of the
/tmp
directory. I added the env var
PANTS_LOCAL_EXECUTION_ROOT_DIR: /home/ci/
, which did fix the issue in my environment.
Sorry, spoke to soon same error
l
ha @rhythmic-battery-45198 welcome to the club ("if fixed it!" "no I didn't")
🙂 1
s
realized I never actually made an issue for this: https://github.com/pantsbuild/pants/issues/16778
@enough-analyst-54434 @witty-crayon-22786 we just hit a different error in the
Parse Dockerfile
process, if it helps shed any more light:
Copy code
/tmp/pants_named_caches/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/bin/python: can't open file '/tmp/pants_named_caches/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/pex': [Errno 2] No such file or directory
e
Yeah, unfortunately just as mysterious. The venv exists (bin/python exists), but the pex script does not. The venv is created in a tmp dir under a POSIX file lock for the final dest dir here: https://github.com/pantsbuild/pex/blob/bba581c1fce226267131db8b7486734f0a2b9261/pex/pex_bootstrapper.py#L487 And the
bin/python
in the tmp dir is in place once this bit is done: https://github.com/pantsbuild/pex/blob/bba581c1fce226267131db8b7486734f0a2b9261/pex/pex_bootstrapper.py#L492-L497 The
pex
script is in place in the tmp dir once this is done: https://github.com/pantsbuild/pex/blob/bba581c1fce226267131db8b7486734f0a2b9261/pex/pex_bootstrapper.py#L532-L541 And just before the atomic_directory block returns, the tmp dir is `os.rename`'d to the dest dir and then the file lock released.
The only 2 things that even come close to making sense here are either a /tmp dir cleaner or POSIX file locks not working + extra craziness since the os.rename really guards the whole atomicity bit.
s
if there was a /tmp cleaner running then I’d expect to see the error even when process cleanup is disabled, but we don’t 🤔
To be very sure I updated our workflow to stop using /tmp for the pants cache/execution dirs - still hitting the bug
e
Thanks Dan. That was a very unlikely route, but its good to have it ruled out. The remaining possibilities are very narrow but I still cannot make sense of them. Basically POSIX file locks failing to work properly + UUID4 generating a collision now and again. That combination seems out of the question. Then again - something is happening!
Right now the UUID4 collision would be silent. I'll make sure that fails loudly to keep the universe at bay: https://github.com/pantsbuild/pex/issues/1901
OK, so uuid4 is just exactly as collision resistant as
os.urandom(16)
on your system. It looks like - unless the Python is configured strangely - that gets
/dev/urandom
bytes: https://github.com/python/cpython/blob/ea1a54506b4ac38b712ba63ec884292025f16111/Python/bootstrap_hash.c#L266-L395
s
more data: I added some logic to our CI system to detect when we hit this error and: 1.
rm -r "${PANTS_NAMED_CACHES_DIR}/pex_root/venvs"
2. re-run
./pants <goal>
up to 3 times. things seem more stable since I added this code, but we still see PRs that hit the error 3/3 times 🤯
my next thought is to
rm -r "${PANTS_NAMED_CACHES_DIR}/pex_root"
in case the bad state is getting cached elsewhere in there, but the reproducibility of the problem feels like another point against the (already very unlikely) possibility of collisions from random byte generation?
e
Afaict you've succeeded in further confirming this is flaky - your CI become less flaky by re-trying, but not much else. I still have no clue how this is happening and I agree that UUID4 being broken is completely unlikely.