great-river-11779
11/17/2023, 1:22 AMnvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cuda-cupti-cu12==12.1.105
...
etc
When adding those to the requirements.txt by hand and to the BUILD depndency
pex_binary(
name="generate",
entry_point="main.py",
dependencies=[
":src",
"//:requirements#nvidia-cuda-nvrtc-cu12",
...
"//:requirements#nvidia-nvtx-cu12",
"//:requirements#triton",
],
args=["generate"],
layout="packed",
execution_mode="venv",
restartable=True,
)
I get issues like this:
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory
make: *** [src/python/dh/infer/Makefile:5: dh-infer-task-generate] Error 1
Any help here would be greatly appreciatedgorgeous-winter-99296
11/17/2023, 11:21 AMpants package path/to:target
and see what's in there - pex is just a glorified zip, so zipinfo
+ grep
is great to figure out what ended up in there. I did that with torch>=2 yesterday and at least saw all things being in the env and cuda working - but I also have a working setup for torch 1.12, so might be picking up things from that as well.
I haven't dared do a full upgrade to torch 2 yet, so it's a bit out of my knowledge domain at the moment. Just as a dumb check, are you pulling in nvidia-nvjitlink-cu12
? I'm not sure if that's meant to be included but isn't, or should be provided by the host.gorgeous-winter-99296
11/17/2023, 11:22 AMlate-advantage-75311
11/17/2023, 11:23 AMgorgeous-winter-99296
11/17/2023, 11:25 AMlate-advantage-75311
11/17/2023, 11:26 AMgreat-river-11779
11/17/2023, 5:40 PMgreat-river-11779
11/17/2023, 6:48 PMgreat-river-11779
11/17/2023, 6:48 PMOSError: libcufft.so.11: cannot open shared object file: No such file or directory
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory
gorgeous-winter-99296
11/17/2023, 6:48 PMgreat-river-11779
11/17/2023, 6:49 PMgreat-river-11779
11/17/2023, 6:49 PMgorgeous-winter-99296
11/17/2023, 6:52 PMtorch>=2
I did for another test the other day without issue.great-river-11779
11/17/2023, 6:57 PMubuntu@inference-1:~$ sudo find / -name 'libcublas.so.*'
/usr/lib/x86_64-linux-gnu/libcublas.so.11.7.4.6
/usr/lib/x86_64-linux-gnu/libcublas.so.11
/home/ubuntu/.pex/installed_wheels/5f88a50378f71b7b289df3fd81567f4a61f8cdf8d103726a9218b500a0f78f6b/nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl/nvidia/cublas/lib/libcublas.so.12
/home/ubuntu/.cache/pants/named_caches/pex_root/installed_wheels/ee53ccca76a6fc08fb9701aa95b6ceb242cdaab118c3bb152af4e579af792728/nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl/nvidia/cublas/lib/libcublas.so.12
great-river-11779
11/17/2023, 6:57 PMgreat-river-11779
11/17/2023, 6:57 PMgorgeous-winter-99296
11/17/2023, 7:13 PMpants run cuda.py
gives me quite sensible outputs. Neither of those libs it complains about in your last post anywhere on my machine outside Pants, and import torch works. Not sure how far it gets before it falls over for you.great-river-11779
11/17/2023, 7:17 PMgorgeous-winter-99296
11/17/2023, 7:19 PMlibcuda.so.1
)great-river-11779
11/17/2023, 7:22 PMgreat-river-11779
11/17/2023, 7:24 PMpants run cuda.py
gorgeous-winter-99296
11/17/2023, 7:26 PMtorch.cuda.init()
and assert on torch.cuda.is_initialized()
and that also worked on my machine. But it might be a lazy loading thing where it only happens once you use @torch.jit or some specific op, for example.gorgeous-winter-99296
11/17/2023, 7:27 PMgreat-river-11779
11/17/2023, 7:28 PMpants run src/python/dh/infer/tasks:generate
Traceback (most recent call last):
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 174, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libcufft.so.11: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/pex", line 274, in <module>
runpy.run_module(module_name, run_name="__main__", alter_sys=True)
File "<frozen runpy>", line 226, in run_module
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/main.py", line 3, in <module>
from .app import app_factory
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/app.py", line 3, in <module>
from .generate import generate_task
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/generate.py", line 11, in <module>
from ..models.mask import generate_mask
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/models/mask.py", line 1, in <module>
import torch
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 234, in <module>
_load_global_deps()
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 195, in _load_global_deps
_preload_cuda_deps(lib_folder, lib_name)
File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 161, in _preload_cuda_deps
ctypes.CDLL(lib_path)
File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory
make: *** [src/python/dh/infer/Makefile:5: dh-infer-task-generate] Error 1
gorgeous-winter-99296
11/17/2023, 7:29 PMpex_binary
vs python_source
is important. Both build a pex, but they execute differently.great-river-11779
11/17/2023, 7:30 PMgreat-river-11779
11/17/2023, 7:30 PMgreat-river-11779
11/17/2023, 7:31 PMpants run src/python/dh/infer/tasks/main.py
great-river-11779
11/17/2023, 7:31 PMgorgeous-winter-99296
11/17/2023, 7:32 PMpex_binary
target will rebuild on any source change IME.great-river-11779
11/17/2023, 7:34 PMgreat-river-11779
11/17/2023, 7:34 PMgreat-river-11779
11/17/2023, 7:34 PMgorgeous-winter-99296
11/17/2023, 7:53 PMexecution_mode="venv"
and get shoved into containers. All our local development we run file targets (though we primarily use cmd:{train,server}
etc for teachability). So that reflects what we've seen here. I did a cursory glance at the code and it looks like the run-source code path does build a venv as well and links it to the sources, so that checks out.gorgeous-winter-99296
11/17/2023, 7:53 PMgreat-river-11779
11/17/2023, 7:57 PMgreat-river-11779
11/17/2023, 7:57 PMgreat-river-11779
11/17/2023, 7:58 PMgorgeous-winter-99296
11/17/2023, 8:04 PMgorgeous-winter-99296
11/17/2023, 8:05 PMcuda.py
as a pex_binary
, then extracting and running the issue at least. So likely some path shenanigans going wrong...great-river-11779
11/17/2023, 8:24 PMpex_binary
too? I mean I'd consider that a win. Thank you so much once again!gorgeous-winter-99296
11/18/2023, 1:26 PMloose
+ venv
works on my reproduction, likely because it ships everything.
Edit: Nope, after clearing out .pex
it breaks again. So nevermind that. But I've found a smoking gun!gorgeous-winter-99296
11/18/2023, 1:55 PM