Hi again Pants dev team! I have been meaning to d...
# general
n
Hi again Pants dev team! I have been meaning to drop back in again to share our experience and where we aim to go from here now that we have been setup on Pants for about a month and have successfully started disting projects from our repo. Unfortunately, for the moment I'm here for help as our builds have recently started failing on production CI servers (still works locally). The error is traced to when Pants tries to install its requirements packages from PyPI. Each attempt results in the following error:
Copy code
18:30:01 SHA256 fingerprint of file:///var/tmp/pjenkslv/jenkins/workspace/casper-codetree-release-lib-ANY/casper/codetree/codetree/src/pantsbuild/pex verified.
18:30:01 Preparinng bootstrap with initial requirements
18:30:01 /ms/dist/python/PROJ/core/3.7.5/exec/bin/python: can't find '__main__' module in '/var/tmp/emmdev/.pex/unzipped_pexes/478cc1fa371ca40aa3e7dafee735ca438d4a243f'
The specific line is:
Copy code
"${python}" "${pex_path}" --cert="${PIP_CERT}" -r requirements.txt -c virtualenv -o virtualenv.pex --retries=1 --timeout=1 --index-url="${PIP_INDEX_URL}"
The extra arguments are our own to make downloading packages work, but this is in the boostrap_virtualenv function, and it has worked fine until just recently. It's very hard to diagnose problems on the CI host, so at the moment I can't tell if this is Pants/Pex related or some problem with the host...
👀 1
e
Hrm, the 1st thing you'd want to do is inspect that host. If I
python .
and I have
__main__.py
in the current directory, it runs. If I rename it to
__not_main__.py
I get your error:
Copy code
$ python .
Hello!
$ mv __main__.py __not_main__.py
$ python .
/home/jsirois/.pyenv/versions/3.10.2/bin/python: can't find '__main__' module in '/tmp/test/.'
$
Is it true that
/var/tmp/emmdev
is the CI user's home dir and that it is in-fact temporary and re-populated on each CI run?
Also - if you can offer up your Pants version, that helps.
And ... on missing files under tmp ... does CI have a tmp cleaner CRON job?
n
Version is 2.9.0. The home directory /var/tmp/emmdev is set because the system id's home directory is not writeable (firm policy); but it is indeed writeable and is what we're told to use when we need temporary data storage (but it gets cleaned up, I'm not sure what the schedule is though, but we haven't worried about caching/performance yet, and were happy to let Pants rebootstrap itself, etc. on each build for now). I made some progress by explicitly deleting ~/.cache/pants and ~/.pex at the beginning of the build script. It now gets passed the part of getting all of its dependencies from PyPI. But now it fails when trying to invoke pip.
Copy code
22:56:37 SHA256 fingerprint of file:///var/tmp/pjenkslv/jenkins/workspace/casper-codetree-release-lib-ANY/casper/codetree/codetree/src/pantsbuild/pex verified.
22:56:37 Preparinng bootstrap with initial requirements
22:56:37 PEX PATH: /var/tmp/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/pex-2.1.42/pex
22:56:37 /var/tmp/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/pex-2.1.42/pex
22:57:48 /var/tmp/emmdev/.pex/unzipped_pexes/478cc1fa371ca40aa3e7dafee735ca438d4a243f/.deps/pex-2.1.42-py2.py3-none-any.whl/pex/tools/commands/venv.py:141: PEXWarning: Encountered collision building venv at /var/tmp/emmdev/.pex/venvs/short/c9600bda from /var/tmp/emmdev/.pex/pip.pex/46820cb5af0dcf9295a4e7f30184cc0e9fa063dc:
22:57:48 1. /var/tmp/emmdev/.pex/venvs/720739c5b08326cc23c9ac0b68c11307ad60aca3/1fd650467e13c9fc5e0f7b7915a685aa6aec963f.02df45dac6084a0a97a4934629166fe7/lib/python3.7/site-packages/constraints.txt was provided by:
22:57:48  /var/tmp/emmdev/.pex/pip.pex/46820cb5af0dcf9295a4e7f30184cc0e9fa063dc/.deps/setuptools/constraints.txt
22:57:48  /var/tmp/emmdev/.pex/pip.pex/46820cb5af0dcf9295a4e7f30184cc0e9fa063dc/.deps/wheel/constraints.txt
22:57:48   pex_warnings.warn(message)
22:57:48 Installing pantsbuild.pants==2.9.0 into a virtual environment at /var/tmp/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/2.9.0_py37
22:57:50 created virtual environment CPython3.7.5.final.0-64 in 189ms
22:57:50   creator CPython3Posix(dest=/var/tmp/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/pants.BVOx5n/install, clear=False, no_vcs_ignore=False, global=False)
22:57:50   seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/var/tmp/emmdev/.local/share/virtualenv)
22:57:50     added seed packages: pip==21.1.2, setuptools==57.0.0, wheel==0.36.2
22:57:50   activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
22:57:50 ./pants: line 364: /var/tmp/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/pants.BVOx5n/install/bin/pip: No such file or directory
I don't know if the PEX warning is a useful clue or not? But _bootstrap_pants_ gets called again a few times after each failure and then the script finally gives up.
e
The Pex warning was fixed in Pex 2.1.44. If you re-curl your pants script you can pick up a fix for that (see: https://www.pantsbuild.org/docs/installation).
As far as the new error goes, that looks like its more of the same. Instead of
__main__.py
not being there
bin/pip
is not there. Both these files should be there / we've never seen this to my knowledge. That's why I pushed on the tmp cleaner angle. If a tmp cleaner CRON job happens by in the middle of a CI run and your HOME is in /tmp, you're in for this sort of trouble. It would be good to hard rule a tmp cleaner in or out here before proceeding further.
@nice-florist-55958 hopefully that's some info you can get from Ops or somesuch if you don't have access to the machines?
@nice-florist-55958 I forgot about this old gem: https://github.com/pantsbuild/pex/issues/958#issuecomment-756293657 That was an issue where
__main__.py
was not found that was due to a PEX zip >2GB in size. Does that pertain to your PEXes? I don't suspect this is it though since you say things work locally just not in CI. The 2GB issue should be universal.
Ah, sorry - that's not it. If it were it I'm pretty sure
/var/tmp/emmdev/.pex/unzipped_pexes/478cc1fa371ca40aa3e7dafee735ca438d4a243f
could never be created from the original too-big PEX file.
n
Indeed, it's not 2GB. I had some success by nuking the tmp homedir on each CI run. In retrospect, I wonder if the private PR job running simultaneously with the release job was causing problems (is there any references on best practices to using a tool like Pants for concurrent CI jobs, possibly pulling code from different branches and clashing with the cache?) But before I resorted to that, I created an NFS compile share for the system id running Pants. It seemed to be working, as Pants successfully bootstrapped itself and I could see its named cache folders. Running
./pants version
returned the expected result too. BUT! When it came time for
./pants package ::
bad things happened. I added some
echo
and
ls
statements in the pants script just before the pex execution line to debug. Seems Pants can in fact see the file its complaining about before the exec line, but not after (maybe in some subprocess?). But you can see below the
ls -la
shows emmdev created that file and that it has execute permissions. But the job ultimately fails for not seeing this file as reported in ProcessExecutionFailure error. I don't know what the significance of a fully resolved NFS file location is (with the server and all) versus the alias, if any, but the latter is the one it attempts to lookup.
Copy code
19:40:07 /v/campus/ny/cs/casper/taymarti/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/2.9.0_py37
19:40:07 /v/campus/ny/cs/casper/taymarti/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/2.9.0_py37/bin/python
19:40:07 /v/campus/ny/cs/casper/taymarti/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/2.9.0_py37/bin/pants
19:40:07 /v/campus/ny/cs/casper/taymarti/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/2.9.0_py37/bin/python /v/campus/ny/cs/casper/taymarti/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/2.9.0_py37/bin/pants --pants-bin-name=./pants --pants-version=2.9.0 package ./proj/libs/clients/**
19:40:07 lrwxrwxrwx 1 emmdev ir_share 11 Feb 23 19:01 /a/stor118ncs2.new-york.ms.com/sc25317/s122015/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex -> __main__.py
19:40:07 lrwxrwxrwx 1 emmdev ir_share 11 Feb 23 19:01 /v/campus/ny/cs/casper/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex -> __main__.py
19:40:07 19:00:15.11 [INFO] Starting: Building build_backend.pex from setuptools_default_lockfile.txt
19:41:31 19:01:31.87 [INFO] Long running tasks:
19:41:31  76.76s  Building build_backend.pex from setuptools_default_lockfile.txt
19:41:36 19:01:36.06 [INFO] Completed: Building build_backend.pex from setuptools_default_lockfile.txt
19:41:36 19:01:36.14 [ERROR] 1 Exception encountered:
19:41:36
19:41:36  ProcessExecutionFailure: Process 'Building build_backend.pex from setuptools_default_lockfile.txt' failed with exit code 1.
19:41:36 stdout:
19:41:36
19:41:36 stderr:
19:41:36 Failed to spawn a job for DistributionTarget(interpreter=PythonInterpreter('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', PythonIdentity('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', 'cp37', 'cp37m', 'manylinux_2_17_x86_64', (3, 7, 5)))): [Errno 2] No such file or directory: '/a/stor118ncs2.new-york.ms.com/sc25317/s122015/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex': '/a/stor118ncs2.new-york.ms.com/sc25317/s122015/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex'
19:41:36
19:41:36
19:41:36
19:41:36 Use `--no-process-cleanup` to preserve process chroots for inspection.
e
I have not digested all this but, NFS. That's a huge variable. Thanks for adding that detail. All caches should be concurrency safe and this has been tested pretty well over a 2+ year period ... but definitely only in the context of local disks.
n
Indeed, seems NFS is a source of many difficulties 😞. I think I reported a problem here before after trying to run Pants locally, but pointing it to use an NFS location for its cache, and then having similar issues as now. I also recall the pantsd daemon did not function well w/ NFS because of issues with inotify there. Unfortunately, NFS is really the only permanent read/write storage solution currently offered at my firm. AFS solutions are read-only, and, even if we could somehow get our own dedicated CI server, local storage is regularly cleaned up and never guaranteed safe.
@enough-analyst-54434 Long shot, but does Pants have an option to prevent executing jobs in spawned subprocesses?
e
It does not. This model is really at the very core of things in Pants.
I mean, yes if you use remote execution exclusively. I think we have a configuration for remote exec only. But that really only works for Linux clients unless you run your own Mac mini server farm and use Mac remote execution worker nodes. As far as I know this would be treading new ground.
So remote execution still spawn subprocesses, just not on your workstation. The remote execution worker nodes would have the same problems if they were only allowed NFS storage.
n
Ok, let me run this again w/ a stack trace and see if our infra guys can figure out if there's anything obvious going on in NFS land.
e
Ok, but I'm leery of putting folks to work due to my ignorance. I definitely have no clue what I'm talking about re NFS. I only know we mmap heavily in the one cache (an LMDB database used by Pants code) and use hardlinks and symlinks extensively in another. In that other I leverage a cooperative file lock heavily to deal with concurrency control (this is in Pex code). I have no clue how NFS / networked file systems handle this collection of file system operations. I need to educate myself.
I haven't read much yet, but for NFS locking to work using Pex's technique (https://github.com/pantsbuild/pex/blob/1cbf72221e53d5866f3d18f6d9ad2b66cd9066b6/pex/common.py#L451-L460) it appears an NFS
lockd
must be running (apparently you can run NFS without one.
What version of the NFS protocol does your setup implement?
n
I can't say for sure yet, but fairly certain
lockd
daemon is always running based on experience with poorly written programs dying when someone has the CSV file open they want to update! 🙂 As for the protocol,
nfsstat -c
is showing 8mm getattr requests on NFS3 and merely 50k on NFS2 for one of the hosts I am using. Testing on newer hosts that support NFS4, virtually all the file system operations are still NFS3. And
rpcinfo -p <http://stor118ncs2.new-york.ms.com|stor118ncs2.new-york.ms.com>
is saying this storage server only supports NFS3 and 4, so I think it's safe to say NFS3 is being used, but I will try to confirm that is really the case tomorrow, and also
netstat -c
the CI server as it might be using an older NFS protocol.
e
Ok - great. Thank you for that information. I'm still learning and reading up, but that helps focus things down.
This was the best summary I could find from a trusted source: https://wiki.wireshark.org/Network_Lock_Manager#history The last 3 paragraphs are damning if taken as reliable and it turns out they happen to be true for whatever implementations of client and server NLM lock managers and kernels you're using under the NFSv3 protocol. In particular the bit about 2 clients holding an exclusive lock. That would break Pex.
Alright, now that I've had the time to put on the "What if I had to implement NLM myself?" hat I can offer up a pointed question for you or your OPs folks: is your NLM setup HA? If so, this enters the realm of Paxos and other distributed lock protocols which are notorious difficult to implement correctly.
n
Let me see if an NFS4 configuration is possible. It seems that would be a resolution as no extra protocol is needed? I had also realized I’ve been using NFS “locally” all this time w/ Pants and never had these issues. I have a suspicion that file permissions are not being properly inherited by the subprocess, resulting in not found and permission denied errors. In fact, when I pointed Pants to an NFS directory I do not own but have UNIX group based permissions to, I got similar errors running locally. So I tried creating a new compile share space for the CI system id with it as the “genuine” (whatever that turns out to mean) owner of that folder location (/v/campus/ny/cs/casper/emmdev instead of …/taymarti/emmdev where I owned the parent space but gave emmdev permissions to the sub-folder; I can’t even stat from my userid the new one now). And it WORKED! It wrote the caches as it did before, but this time the sub processes spawned by emmdev could also see those files. Unfortunately, we now have a new error. On any subsequent run I’m getting:
Copy code
09:58:23 09:58:22.22 [ERROR] 1 Exception encountered:
09:58:23
09:58:23   Exception: Snapshot failed: Error storing Digest { hash: Fingerprint<827b11f107d720685bb9b013fafc76f126779197f63f9a831300554e714949ec>, size_bytes: 77 }: MDB_CURSOR_FULL: Internal error - cursor stack limit reached
e
Let me see if an NFS4 configuration is possible. It seems that would be a resolution as no extra protocol is needed?
Perhaps, but not a solution if the HA guess is right. For that class of problems, no matter the protocol used by clients, the multi-homed server / Byzantine Generals problem becomes the issue.
I had also realized I’ve been using NFS “locally” all this time w/ Pants and never had these issues. ... and the rest ...
Excellent! That's great information to have.
Unfortunately, we now have a new error. On any subsequent run I’m getting:
Now that's a cool one. This too may or may not be related to NFS but its new to any of us as far as I know and worth an issue to track!
I'll write an issue up here shortly and we can move debugging there. This is getting complex but we're getting closer and Slack is a black hole eventually.
n
Do you want me to run the package cmd with any options for the issue ticket?
e
I haven't swung around to this yet, but what I'll want to gather is all the failure scenarios and separate them 1st. There are mode bits / perms issues and the MDB_CURSOR_FULL issue. If the mode bits / perms issues are still issues, I'll want output from those commands, but you may have already provided those in this thread - I need to re-read in detail.
OK - I opened https://github.com/pantsbuild/pants/issues/14599 but that's still a hack job until I circle back around to give this full care. You can add or amend as needed though right away. I am insterested to know though if the
MDB_CURSOR_FULL
is repeatable and if it goes away after killing the
pantsd
daemon and re-running.
n
pantsd=false in the configuration. Let me try to repeat w/ stack trace/debug flags on.
I have good and bad news…I’m unable to reproduce the new error after multiple attempts…Not sure if it’s a one-off or what, but will report if it happens again — I have all the extra verbosity enabled now which will hopefully help. The NFS permission errors not propagating to sub processes is a mystery though still (if that turns out to be the correct way to characterize that issue). I’ll recreate that scenario with the extra verbosity next just to see if that reveals any clues. For the cache corruption or whatever that is (when Pants randomly starts to fail the bootstrap step after a long period of success due to missing files — before any goals like package are ran), TBD on how to reproduce reliably — when using local /var/tmp, deleting it first seems to work, but not sure if the issue is caused by concurrent runs and NFS3 locking issues leading to corruption or if it’s just our idiosyncratic setup.
Assuming nothing breaks imminently, I guess this means we are now back up and running, and as a bonus we’ve moved to a permanent file storage solution so can take better advantage of the cache.
e
For all things "propagation to subprocesses", env vars is the 1st debugging stop since pants actively masks the environment except for very specific env vars.
So, if kerberos, ssh, etc - holes need to be poked for particular env vars using Pants options in pants.toml.
I'm ignorant of NFS as mentioned so I have no clue if env vars could be involved here.
n
Ah, it could indeed be Kerberos related. I’ll look more closely later and see if we can pinpoint. And not super helpful, but here is the full stack trace:
Copy code
14:37:15 14:37:15.04 [DEBUG] Completed: `package` goal
14:37:15 14:37:15.04 [DEBUG] computed 1 nodes in 33.880423 seconds. there are 3957 total nodes.
14:37:15 14:37:15.04 [ERROR] 1 Exception encountered:
14:37:15
14:37:15 Engine traceback:
14:37:15   in select
14:37:15   in pants.core.goals.package.package_asset
14:37:15   in pants.backend.python.goals.setup_py.package_python_dist (proj/libs/clients:clients)
14:37:15   in pants.backend.python.util_rules.dists.run_pep517_build
14:37:15   in pants.backend.python.util_rules.pex.create_venv_pex (build_backend.pex)
14:37:15   in pants.backend.python.util_rules.pex.build_pex (build_backend.pex)
14:37:15   in pants.engine.process.fallible_to_exec_result_or_raise
14:37:15 Traceback (most recent call last):
14:37:15   File "/v/campus/ny/cs/casper/taymarti/emmdev/.cache/pants/setup/bootstrap-Linux-x86_64/2.9.0_py37/lib/python3.7/site-packages/pants/engine/process.py", line 287, in fallible_to_exec_result_or_raise
14:37:15     process_cleanup=process_cleanup.val,
14:37:15 pants.engine.process.ProcessExecutionFailure: Process 'Building build_backend.pex from setuptools_default_lockfile.txt' failed with exit code 1.
14:37:15 stdout:
14:37:15
14:37:15 stderr:
14:37:15 pex: Resolving interpreters
14:37:15 pex: Resolving interpreters: 0.5ms
14:37:15 pex: Building pex
14:37:15 pex: Building pex :: Resolving distributions (setuptools_default_lockfile.txt)
14:37:15 pex: Building pex :: Resolving distributions (setuptools_default_lockfile.txt) :: Resolving requirements.
14:37:15 pex: Building pex :: Resolving distributions (setuptools_default_lockfile.txt) :: Resolving requirements. :: Resolving for:
14:37:15   DistributionTarget(interpreter=PythonInterpreter('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', PythonIdentity('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', 'cp37', 'cp37m', 'manylinux_2_17_x86_64', (3, 7, 5))))
14:37:15 pex: Spawning a maximum of 4 parallel jobs to process:
14:37:15   DistributionTarget(interpreter=PythonInterpreter('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', PythonIdentity('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', 'cp37', 'cp37m', 'manylinux_2_17_x86_64', (3, 7, 5))))
14:37:15 pex: Hashing pex
14:37:15 pex: Hashing pex: 14657.6ms
14:37:15 pex: Isolating pex
14:37:15 pex: Isolating pex: 21.4ms
14:37:15 Failed to spawn a job for DistributionTarget(interpreter=PythonInterpreter('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', PythonIdentity('/ms/dist/python/PROJ/core/3.7.5-0/.exec/@sys/bin/python3.7', 'cp37', 'cp37m', 'manylinux_2_17_x86_64', (3, 7, 5)))): [Errno 2] No such file or directory: '/a/<http://stor118ncs2.new-york.ms.com/sc25317/s122015/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex':|stor118ncs2.new-york.ms.com/sc25317/s122015/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex':> '/a/<http://stor118ncs2.new-york.ms.com/sc25317/s122015/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex'|stor118ncs2.new-york.ms.com/sc25317/s122015/taymarti/emmdev/.cache/pants/named_caches/pex_root/venvs/1d932cd3e82057d61e57c365b177aad9b535724c/9a128dacefb3843fa45de2c0dc225c7ee1cb4d0e/pex'>
14:37:15
14:37:15
14:37:15
14:37:15 Use `--no-process-cleanup` to preserve process chroots for inspection.
Actually, on closer inspection, it says package goal succeeded? Maybe that’s a premature declaration of success!
e
Completed, not succeeded. Unless I'm missing the success claim line.
We're undoubtedly less word-polished at levels finer than info!
😁 1