Running into an interesting lockfile generation is...
# general
r
Running into an interesting lockfile generation issue with
torch
package. 🧵
✅ 1
Was seeing the following error raised after adding a
torch
dependency
Copy code
stderr:
Failed to resolve requirements from PEX environment @ /home/ci/.cache/pants/named_caches/pex_root/unzipped_pexes/a5788e2aff6bc00ba7d5959665bfbcc6e8421d1c.
Needed cp39-cp39-manylinux_2_35_x86_64 compatible dependencies for:
 1: nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux"
    Required by:
      torch 1.13.1
    But this pex had no ProjectName(raw='nvidia-cuda-runtime-cu11', normalized='nvidia-cuda-runtime-cu11') distributions.
 2: nvidia-cudnn-cu11==8.5.0.96; platform_system == "Linux"
    Required by:
      torch 1.13.1
    But this pex had no ProjectName(raw='nvidia-cudnn-cu11', normalized='nvidia-cudnn-cu11') distributions.
 3: nvidia-cublas-cu11==11.10.3.66; platform_system == "Linux"
    Required by:
      torch 1.13.1
    But this pex had no ProjectName(raw='nvidia-cublas-cu11', normalized='nvidia-cublas-cu11') distributions.
 4: nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux"
    Required by:
      torch 1.13.1
    But this pex had no ProjectName(raw='nvidia-cuda-nvrtc-cu11', normalized='nvidia-cuda-nvrtc-cu11') distributions.
Checked the lockfile. As expected, those nvidia distributions were missing.
Copy code
"requires_dists": [
            "opt-einsum>=3.3; extra == \"opt-einsum\"",
            "typing-extensions"
]
Did some pex poking around. Isolated the interesting behavior
With
--style universal
and
--target-system linux
options set, the lockfile is generated without the expected nvidia libs. This is the configuration used by pants when the error is raised.
Copy code
pex3 lock create --style=universal --resolver-version pip-2020-resolver --target-system linux torch==1.13.1
If I instead use the default
--style strict
on my linux x86_64 machine, the nvidia distributions make it into the lockfile
Copy code
pex3 lock create --style=strict --resolver-version pip-2020-resolver torch==1.13.1
Copy code
"requires_dists": [
            "nvidia-cublas-cu11==11.10.3.66; platform_system == \"Linux\"",
            "nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == \"Linux\"",
            "nvidia-cuda-runtime-cu11==11.7.99; platform_system == \"Linux\"",
            "nvidia-cudnn-cu11==8.5.0.96; platform_system == \"Linux\"",
            "opt-einsum>=3.3; extra == \"opt-einsum\"",
            "typing-extensions"
          ],
The root issue seems to be that different
torch
wheels list different transitive dependencies. It looks like the transitive dependencies are picked up from the first artifact listed in the lockfile, which looks like the same order that is listed on pypi. https://pypi.org/project/torch/1.13.1/#files
With universal style, https://files.pythonhosted.org/packages/86/08/41315a205bcd103a9698fa8afafbb73a234db8791c[…]b10243a7/torch-1.13.1-cp39-cp39-manylinux2014_aarch64.whl is the first artifact. This whl lacks nvidia dependencies in its
METADATA
file.
Copy code
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.7.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: typing-extensions
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
With strict style, https://files.pythonhosted.org/packages/81/58/431fd405855553af1a98091848cf97741302416b01[…]09d3c422b3/torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl is the artifact. This whl does include nvidia dependencies in its
METADATA
file.
Copy code
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.7.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: typing-extensions
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux"
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux"
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux"
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
I can probably work around this easily by adding explicit dependencies for the
torch
python_requirement
(https://www.pantsbuild.org/docs/python-third-party-dependencies#requirements-with-undeclared-dependencies)
I am wondering if this is a • just a misbehaving package (not sure if all wheels should list the same dependencies) • something that I can configure pants/pex to handle today • a case not currently handled by pants/pex /end
e
A core assumption to make locks feasible at all is the assumption you've discovered - all distributions for a given version, sdist or whl, have the same deps. Without that assumption, you'd be forced to download every wheel for a given version and that is prohibitive in time and bandwidth. The discourse thread for this rejected PEP talks about this necessary assumption: https://peps.python.org/pep-0665/
Really the assumption has nothing to do with making locks feasible, it has to do with making resolving in general feasible.
So, yeah, this is really torch doing something "legal" that is, however, hostile to tooling.
@rhythmic-battery-45198 hopefully you can work around as you suggested - I really have no ideas on how to approach a fix for this. It would require making Pip work in a way that it does not currently.
They use environment markers; so the state of those wheels is a bit strange. They could have uniform dependency metadata with more use of environment markers. So its not that the maintainers reject the use of environment markers. It seems more like they are unaware the pain they are causing or ... not sure.
r
Ok thanks for the background! That all makes sense to me and is what I was suspecting. Workaround should be pretty painless
e
Thanks for digging on that one. Definitely an interesting case. Concerning too. Torch is pretty popular. This is going to be a wider problem.
g
I ran into the same issue here and resolved it by manually overriding my lock file’s
torch::requires_dists
array to this:
Copy code
"project_name": "torch",
"requires_dists": [
  "opt-einsum>=3.3; extra == \"opt-einsum\"",
  "typing-extensions",
  "nvidia-cublas-cu11==11.10.3.66; platform_system == \"Linux\"",
  "nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == \"Linux\"",
  "nvidia-cuda-runtime-cu11==11.7.99; platform_system == \"Linux\"",
  "nvidia-cudnn-cu11==8.5.0.96; platform_system == \"Linux\""
],
I’m testing out using dependency overrides in my BUILD files instead now to see if I can do that instead. Curious if you ever got this to work @rhythmic-battery-45198?
Side note, I’m seeing those dependencies exposed from PyPI’s JSON endpoint, I wonder what PyTorch is doing to expose them there to not also expose that information to where pants needs it:
Copy code
curl -s <https://pypi.org/pypi/torch/1.13.1/json> | jq -r ".info.requires_dist[]"
e
The json endpoint has the same problem I'd guess. If the METADATA is not the same in each wheel, it picks one and displays it. And they picked one you happen to like.
That might change tomorrow if some - you'd think unimportant - sorting changes. New wheel gets picked as the random provider of METADATA for that version and you lose.
🤔 1
I remember a thread with Donald Stufft where he points out this problem. He maintains PyPI.
🤔 1
r
I added the nvidia packages to my requirements file and these overrides to my python_requirements target.
Copy code
overrides={
        "torch": {
            "dependencies": [
                "#nvidia-cuda-runtime-cu11",
                "#nvidia-cudnn-cu11",
                "#nvidia-cublas-cu11",
                "#nvidia-cuda-nvrtc-cu11"
            ]
        }
    }
🙌 2
g
Perfect, this worked for me and is much better than manually editing my Lock file. Thank you! It’s not actually an issue, but I’m surprised that pants doesn’t inject these dependencies into the underlying
setup.py
files for the resulting python distribution -
torch==1.13.x
is the only dependency that shows up there.
Even more painful, this list of requirements needs to be carefully maintained when upgrading to new torch versions. Looks like they are doubling the size of nvidia packages needed on the next release https://github.com/pytorch/pytorch/pull/89944 It looks like they inject dependencies at the wheel level instead of the package level since since they build CUDA, ROCm, and CPU versions of the wheels.
a
Hi. I faced this issue today. I see that it has already been discussed. The option to manually specify all requirements seems to solve the problem, but I'm looking for a potential way to simplify it. I'm don't mind building and running only on linux. I saw that there are parameters to
pex lock create
that control the lockfile generation, namely
--style=strict
and
--target-system=linux
. I couldn't find a way to set this in pants though. I'm also not sure what they do exactly, and what are the potential caveats of using them. Is it documented somewhere? So for example would running
pex lock create --style=strict
on CentOS make it unusable on other distros due to some manylinux compatibility issues (I'm sorry if this sound silly but I'm not well-read in this area 😓)
e
Yeah, Pants does not allow you to pick lock style or the list of target systems. If it did, the strict style would have the problem you guessed. As to documentation, there is just CLI help for locking:
Copy code
$ pex.venv/bin/pex3 lock create --help
usage: pex3 lock create [-h] [--style {strict,sources,universal}]
                        [--target-system {linux,mac,windows}]
                        [--path-mapping PATH_MAPPINGS] [-o PATH]
                        [--indent INDENT] [-r FILE or URL]
                        [--constraints FILE or URL] [--python PYTHON]
                        [--python-path PYTHON_PATH]
                        [--interpreter-constraint INTERPRETER_CONSTRAINT]
                        [--platform PLATFORMS]
                        [--complete-platform COMPLETE_PLATFORMS]
                        [--manylinux [ASSUME_MANYLINUX]]
                        [--resolve-local-platforms]
                        [--resolver-version {pip-legacy-resolver,pip-2020-resolver}]
                        [--pip-version {vendored,20.3.4-patched,22.2.2,22.3,22.3.1,23.0,23.0.1}]
                        [--allow-pip-version-fallback] [--pypi] [-f PATH/URL]
                        [-i URL] [--retries RETRIES] [--timeout SECS]
                        [--proxy PROXY] [--cert PATH] [--client-cert PATH]
                        [--cache-ttl DEPRECATED] [-H DEPRECATED] [--pre]
                        [--wheel] [--build] [--prefer-wheel] [--force-pep517]
                        [--build-isolation] [--transitive] [-j JOBS]
                        [--preserve-pip-download-log] [-v] [--emit-warnings]
                        [--pex-root PEX_ROOT] [--disable-cache]
                        [--cache-dir CACHE_DIR] [--tmpdir TMPDIR]
                        [--rcfile RC_FILE]
                        [requirements ...]

optional arguments:
  -h, --help            show this help message and exit
  --style {strict,sources,universal}
                        The style of lock to generate. The 'strict' style is the
                        default and generates a lock file that contains exactly
                        the distributions that would be used in a local PEX
                        build. If an sdist would be used, the sdist is included,
                        but if a wheel would be used, an accompanying sdist will
                        not be included. The 'sources' style includes locks
                        containing both wheels and the associated sdists when
                        available. The 'universal' style generates a universal
                        lock for all possible target interpreters and platforms,
                        although the scope can be constrained via one or more
                        --interpreter-constraint. Of the three lock styles, only
                        'strict' can give you full confidence in the lock since
                        it includes exactly the artifacts that are included in
                        the local PEX you'll build to test the lock result with
                        before checking in the lock. With the other two styles
                        you lock un-vetted artifacts in addition to the 'strict'
                        ones; so, even though you can be sure to reproducibly
                        resolve those same un-vetted artifacts in the future,
                        they're still un-vetted and could be innocently or
                        maliciously different from the 'strict' artifacts you can
                        locally vet before committing the lock to version
                        control. The effects of the differences could range from
                        failing a resolve using the lock when the un-vetted
                        artifacts have different dependencies from their sibling
                        artifacts, to your application crashing due to different
                        code in the sibling artifacts to being compromised by
                        differing code in the sibling artifacts. So, although the
                        more permissive lock styles will allow the lock to work
                        on a wider range of machines /are apparently more
                        convenient, the convenience comes with a potential price
                        and using these styles should be considered carefully.
  --target-system {linux,mac,windows}
                        The target operating systems to generate the lock for.
                        This option applies only to `--style universal` locks and
                        restricts the locked artifacts to those compatible with
                        the specified target operating systems. By default,
                        'universal' style locks include artifacts for all
                        operating systems.
g
🙃 back again with this issue for a completely different package,
open3d
. The dependencies you’ll find at
curl -s <https://pypi.org/pypi/open3d/0.17.0/json> | jq -r ".info.requires_dist[]"
are wildly different than the dependencies on the wheel we need @
open3d-0.17.0-cp38-cp38-manylinux_2_27_x86_64.whl
. In our case the results from
generate-lockfiles
changed in a matter of days (I believe this was triggered by a new wheel upload to PyPI) leading to a breaking change where it removed a large number of packages from a lock file.
e
Not much I know how to do. What are you hoping for @gentle-painting-24549?
g
Oh nothing at all - I was able to resolve the issue using
overrides
. Just wanted to leave a note about it here in case someone runs into the same issue with open3d. The issue was super perplexing for the team - it was very useful to have known about this.
e
Gotcha.