we’re trying to switch to the new pex-based lockfi...
# general
s
we’re trying to switch to the new pex-based lockfiles in 2.11, and we’re seeing errors like this in CI runs:
Copy code
Traceback (most recent call last):
  File "/github/home/.cache/pants/named_caches/pex_root/venvs/s/037c1085/venv/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 223, in _main
    status = self.run(options, args)
  File "/github/home/.cache/pants/named_caches/pex_root/venvs/s/037c1085/venv/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 180, in wrapper
    return func(self, options, args)
  File "/github/home/.cache/pants/named_caches/pex_root/venvs/s/037c1085/venv/lib/python3.8/site-packages/pip/_internal/commands/wheel.py", line 174, in run
    build_successes, build_failures = build(
  File "/github/home/.cache/pants/named_caches/pex_root/venvs/s/037c1085/venv/lib/python3.8/site-packages/pip/_internal/wheel_builder.py", line 338, in build
    wheel_file = _build_one(
  File "/github/home/.cache/pants/named_caches/pex_root/venvs/s/037c1085/venv/lib/python3.8/site-packages/pip/_internal/wheel_builder.py", line 238, in _build_one
    _verify_one(req, wheel_path)
  File "/github/home/.cache/pants/named_caches/pex_root/venvs/s/037c1085/venv/lib/python3.8/site-packages/pip/_internal/wheel_builder.py", line 193, in _verify_one
    with zipfile.ZipFile(wheel_path, allowZip64=True) as zf:
  File "/opt/python/3.8.12/lib/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/opt/python/3.8.12/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Traceback (most recent call last):
  File "/github/home/.cache/pants/named_caches/pex_root/installed_wheels/6a5463833452712ecf58a573b1b489c81d14aaed599e955c2af85cf793217a42/pex-2.1.84-py2.py3-none-any.whl/pex/jobs.py", line 512, in execute_parallel
    yield spawn_result.spawned_job.await_result()
  File "/github/home/.cache/pants/named_caches/pex_root/installed_wheels/6a5463833452712ecf58a573b1b489c81d14aaed599e955c2af85cf793217a42/pex-2.1.84-py2.py3-none-any.whl/pex/jobs.py", line 216, in await_result
    job.wait()
  File "/github/home/.cache/pants/named_caches/pex_root/installed_wheels/6a5463833452712ecf58a573b1b489c81d14aaed599e955c2af85cf793217a42/pex-2.1.84-py2.py3-none-any.whl/pex/jobs.py", line 75, in wait
    self._check_returncode(stderr)
  File "/github/home/.cache/pants/named_caches/pex_root/installed_wheels/6a5463833452712ecf58a573b1b489c81d14aaed599e955c2af85cf793217a42/pex-2.1.84-py2.py3-none-any.whl/pex/jobs.py", line 144, in _check_returncode
    raise self.create_error(msg, stderr=stderr)
  File "/github/home/.cache/pants/named_caches/pex_root/installed_wheels/6a5463833452712ecf58a573b1b489c81d14aaed599e955c2af85cf793217a42/pex-2.1.84-py2.py3-none-any.whl/pex/jobs.py", line 125, in create_error
    raise self.Error(
pex.jobs.Error: Executing /github/home/.cache/pants/named_caches/pex_root/venvs/7bbcd6f261771e5523bda29675060150642cf629/f679563b742633f5edaba61ca7d405616ffc49b3/pex --disable-pip-version-check --no-python-version-warning --exists-action a --use-deprecated legacy-resolver --isolated -q --cache-dir /github/home/.cache/pants/named_caches/pex_root wheel --no-deps --wheel-dir /github/home/.cache/pants/named_caches/pex_root/built_wheels/sdists/CensusData-1.3.tar.gz/127dc284eb8aec79109954d93cc45f450f3a42c683f8beed215cc23610b005ba/cp38-cp38.d045f6a0ac354ac0a72574c89b2710a4 /github/home/.cache/pants/named_caches/pex_root/downloads/127dc284eb8aec79109954d93cc45f450f3a42c683f8beed215cc23610b005ba/CensusData-1.3.tar.gz --index-url <https://pypi.org/simple> --retries 5 --timeout 15 failed with 2
that error specifically mentions
CensusData
but I’ve seen it on other packages too so I don’t think it’s related to the package itself. has anyone else seen this error and know what’s going on?
cc @enough-analyst-54434 have you run into this one before? 🙂
h
s
we got past the above and are now hitting this issue
(“the above” being Michael’s issues with mismatched transitive deps)
the error I posted looks to me like either a bug in pex or some strangeness in our CI system that’s corrupting our on-disk cache
Copy code
zipfile.BadZipFile: File is not a zip file
is the line in the traceback that makes me think corrupted cache - just not sure where the corruption is coming from
h
Can you move
~/.cache/pants/named_caches/pex_root/
to the side (while preserving it) and see if that helps?
e
I think that's the same error as mine. Also bad/truncated zip files. Also only happened after I enabled lockfiles/resolves (in the same codebase).
s
yeah, and it doesn’t happen 100% of the time for me - my first run in CI hit it, then I rebased and dodged it (with a fresh cache), then I rebased again and hit it (with a fresh cache again)
h
And neither of you see this on your desktops, only in CI?
s
So far yes for me, but I haven't tried an exact repro locally yet. Will try it out later today
e
Hrm, so this is roughly Pex calling
pip wheel CensusData-1.3.tar.gz
to turn a downloaded sdist into a wheel. Its during that Pip call that Pip, after building the wheel, tries to verify the wheel and fails. I have not seen this.
s
I'll see if I can get GHA to export the cache as a build artifact so we can see what's going on
e
A pared down repro that does not repro is:
Copy code
$ python3.8 -mvenv 3.8.venv
$ 3.8.venv/bin/pip -q install -U pip==20.3.4 wheel==0.37.1 setuptools==44.0.0
$ 3.8.venv/bin/pip wheel /home/jsirois/.pex/downloads/127dc284eb8aec79109954d93cc45f450f3a42c683f8beed215cc23610b005ba/CensusData-1.3.tar.gz
$ ls -lh CensusData-1.3-py3-none-any.whl 
-rw-r--r-- 1 jsirois jsirois 21M May 24 18:31 CensusData-1.3-py3-none-any.whl
Nothing crazy there - 21M is not too big.
The
pip wheel
command does have
--no-verify                 Don't verify if built wheel is valid.
which Pex does not utilize. Presumably that would normally just save time, but in this case the verify failure does seem legit. You would not want to proceed with a truncated zip.
s
Yeah the same error has shown up multiple times, but with a different package being flagged every time. When I've spot-checked individual packages they never repro
e
It looks like Pex does have plumbing to pass
--no-verify
but its not hooked up to the Pex CLI option
--ignore-errors
which would be the most likely place. Either way, Pants does not pass this anyhow and it smells like it would just mask a real (mysterious) problem here.
@sparse-lifeguard-95737 I wish I had ideas here but nothing I can think of beyond exactly what the error message says, Somehow the zip file is getting truncated when written by Pip and yet that truncation is not causing Pip an IOError such that its only later discovered when trying to read the wheel in to verify it, I'm not sure how that can happen,
Oh wait! That is a super long path.
IIRC there is a misleading error when in fact path too long.
Yeah:
Copy code
$ echo "/github/home/.cache/pants/named_caches/pex_root/built_wheels/sdists/CensusData-1.3.tar.gz/127dc284eb8aec79109954d93cc45f450f3a42c683f8beed215cc23610b005ba/cp38-cp38.d045f6a0ac354ac0a72574c89b2710a4 /github/home/.cache/pants/named_caches/pex_root/downloads/127dc284eb8aec79109954d93cc45f450f3a42c683f8beed215cc23610b005ba/CensusData-1.3.tar.gz" | wc -c
343
343 > 256 == boom IIRC.
I'll try to dig that recollection up...
But I'll 1st see if I can repro with a similar long path...
s
🤯
If that is the problem, is there anything we can do about it? Most of the path seems to be shas with a fixed length
I guess we can override the cache root to something short
e
I think this idea is right, but the limits in my head are wrong.:
Copy code
$ grep _MAX /usr/src/linux/include/uapi/linux/limits.h 
#define NGROUPS_MAX    65536	/* supplemental group IDs are available */
#define ARG_MAX       131072	/* # bytes of args + environ for exec() */
#define LINK_MAX         127	/* # links a file may have */
#define NAME_MAX         255	/* # chars in a file name */
#define PATH_MAX        4096	/* # chars in a path name including nul */
#define XATTR_NAME_MAX   255	/* # chars in an extended attribute name */
#define XATTR_SIZE_MAX 65536	/* size of an extended attribute value (64k) */
#define XATTR_LIST_MAX 65536	/* size of extended attribute namelist (64k) */
#define RTSIG_MAX	  32
And I think that's typical. 255 max length for an individual file or directory name, 4096 for the total path length. Neither of which are you even close to in your backtrace. Sorry for the noise!
Hmm, so this is a guess: Pex gives its vendored Pip a custom
--cache-dir
in
~/.pex/
Under that dir Pip writes cache entries in
http/
and
wheels/
. It could be that Pip is not concurrent-safe and that you're managing to hit concurrent runs of Pex that both attempt to build a given sdist to a wheel at the ~same time. Pex uses an
atomic_directory
primitve for this sort of thing extensively in its own code and this was pushed by the heavy parallelism of the Pants v2 engine feretting out Pex bugs wrt file system atomicity. It never did flush out a similar issue with Pip, but you may be doing so here. I'll go read some pip code to see if this theory holds water...
Ok, Pip uses shutil.move to move the built wheel to the
wheels/
cache dir. I think that means that if
/tmp
- which is where building happens by default - is on a different filesystem from
~/.pex
(or
~/.cache/pants/named_caches/pex_root
in the Pants use of Pex case), then the move is not atomic. @sparse-lifeguard-95737 that should be something easy to check for your CI setup.
It looks like Pants puts the Pex
--tmpdir
as a relative path in the sandbox root. By default sandboxes are built by Pants in
/tmp/
whereas the Pex root is in `~/.cache/pants/named_caches/pex_root`; so this idea is still not dead.
The two global options to play with are: `--named-caches-dir`: https://www.pantsbuild.org/docs/reference-global#section-named-caches-dir `--local-execution-root-dir`: https://www.pantsbuild.org/docs/reference-global#section-local-execution-root-dir If those are made to reside on the same filesystem, I predict the problem goes away. There is still a bug for Pex to fix here - somehow - but if my hunch is right about you having 2 different filesystems for these 2 different paths, this would help prove the idea and the bug.
Yeah @sparse-lifeguard-95737 that's it. Here I set the
--tmpdir
to be in the PEX_ROOT (
~/.pex
) and everything is fine:
Copy code
$ count=0; while rm -rf $HOME/.pex/tmp && mkdir -p $HOME/.pex/tmp && ./repro.sh --tmpdir $HOME/.pex/tmp; do echo "Attempt $((count+=1))"; done
Attempt 1
Attempt 2
Attempt 3
Attempt 4
Attempt 5
Attempt 6
Attempt 7
Attempt 8
Attempt 9
Attempt 10
^C
But here it defaults to
/tmp
which is a different filesystem from
~
on my machine (tmpfs vs ext4 partition):
Copy code
$ count=0; while ./repro.sh; do echo "Attempt $((count+=1))"; done
  ERROR: Command errored out with exit status 1:
   command: /home/jsirois/.pex/venvs/s/99506507/venv/bin/python3.11 /home/jsirois/.pex/venvs/s/99506507/venv/lib/python3.11/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-bcsxoi58/overlay --no-warn-script-location --no-binary :all: --only-binary :none: -i <https://pypi.org/simple> -- 'setuptools>=40.8.0' wheel
       cwd: None
  Complete output (21 lines):
  Collecting setuptools>=40.8.0
    Using cached setuptools-62.3.2.tar.gz (2.6 MB)
    Getting requirements to build wheel: started
    Getting requirements to build wheel: finished with status 'done'
    Installing backend dependencies: started
    Installing backend dependencies: finished with status 'done'
      Preparing wheel metadata: started
      Preparing wheel metadata: finished with status 'done'
  Collecting wheel
    Using cached wheel-0.37.1.tar.gz (66 kB)
  Skipping wheel build for wheel, due to binaries being disabled for it.
  Building wheels for collected packages: setuptools
    Building wheel for setuptools (PEP 517): started
    Building wheel for setuptools (PEP 517): finished with status 'done'
    Created wheel for setuptools: filename=setuptools-62.3.2-py3-none-any.whl size=1225929 sha256=bb23f2e58e6137a8bdfc42b97cc8b77e3b0c11df8387ac7ab1c2745015e14964
    Stored in directory: /home/jsirois/.cache/pip/wheels/70/a3/cf/340de2f5eb2aa7abaf1059081a1f19f96bf7e1f4bb8644c0cb
    WARNING: Built wheel for setuptools is invalid: setuptools has an invalid wheel, could not read 'setuptools-62.3.2.dist-info/WHEEL' file: BadZipFile('Truncated file header')
  Failed to build setuptools
  ERROR: Could not build wheels for setuptools which use PEP 517 and cannot be installed directly
  WARNING: You are using pip version 20.3.4; however, version 22.1.1 is available.
  You should consider upgrading via the '/home/jsirois/.pex/venvs/s/99506507/venv/bin/python3.11 -m pip install --upgrade pip' command.
  ----------------------------------------
ERROR: Command errored out with exit status 1: /home/jsirois/.pex/venvs/s/99506507/venv/bin/python3.11 /home/jsirois/.pex/venvs/s/99506507/venv/lib/python3.11/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-bcsxoi58/overlay --no-warn-script-location --no-binary :all: --only-binary :none: -i <https://pypi.org/simple> -- 'setuptools>=40.8.0' wheel Check the logs for full command output.
pid 76513 -> /home/jsirois/.pex/venvs/97931ae45d86d091c7ecbdc5f93be710c50be86e/4e65055febaf795bcb2334d120ef464ddf6b4963/pex --disable-pip-version-check --no-python-version-warning --exists-action a --use-deprecated legacy-resolver --isolated -q --cache-dir /home/jsirois/.pex download --dest /tmp/tmpwaxtu38q/home.jsirois..pyenv.versions.3.11.0b1.bin.python3.11 --no-binary :all: psutil --index-url <https://pypi.org/simple> --retries 5 --timeout 15 exited with 1 and STDERR:
None
Where
repos.sh
is:
Copy code
#!/usr/bin/env bash

set -euo pipefail

rm -rf ~/.pex/built_wheels/ ~/.pex/installed_wheels/ ~/.pex/wheels/ psutil.pex*

pids=()
for i in $(seq 1 5); do
  python -m pex --no-wheel psutil -o psutil-$i.pex "$@" &
  pids+=($!)
done
for pid in ${pids[@]}; do
  wait ${pid}
done
So to workaround today, you'll need to align the Pants setting I mentioned above. Pex needs to force the Pip tmpdir to be on the same fs as the Pip cache, but Pants will also need to be fixed I think not to actively use a harmful
--tmpdir
. Perhaps I can just have Pex defend against that though. I'll file issues a little later and link here.
Thanks for digging this up!
h
Wow, great debugging! Thanks all!
1
s
Thank you @enough-analyst-54434!! confirmed that
~
is a bind-mounted dir in our CI system, while
/tmp
is purely inside the container’s overlay FS. will reconfigure the system to store the pants cache in
/tmp
h
I'll add some documentation about this
e
Thanks @happy-kitchen-89482. I'll try to make that short-lived advice. Filed the revolutionary https://github.com/pantsbuild/pex/issues/1776