Hey everyone - I'm looking for some help regarding...
# general
g
Hey everyone - I'm looking for some help regarding using pants with Pytorch and the whole gpu/cpu situation. I've been reading the slack logs + github issues from 1-2 years ago trying to figure out how things work and what I can do to get things going and I can't for the life of me figure anything out. It seems like all the comments were from years ago too - so maybe I'm the only one who can't get things working? Any guidance/path forward would be much appreciated. In terms of errors I'm getting libraries missing:
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cuda-cupti-cu12==12.1.105
...
etc When adding those to the requirements.txt by hand and to the BUILD depndency
pex_binary(
name="generate",
entry_point="main.py",
dependencies=[
":src",
"//:requirements#nvidia-cuda-nvrtc-cu12",
...
"//:requirements#nvidia-nvtx-cu12",
"//:requirements#triton",
],
args=["generate"],
layout="packed",
execution_mode="venv",
restartable=True,
)
I get issues like this:
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory
make: *** [src/python/dh/infer/Makefile:5: dh-infer-task-generate] Error 1
Any help here would be greatly appreciated
g
I should probably write a guide šŸ˜„ It's hard to tell exactly what the issue is from your errors. I think the best bet is to
pants package path/to:target
and see what's in there - pex is just a glorified zip, so
zipinfo
+
grep
is great to figure out what ended up in there. I did that with torch>=2 yesterday and at least saw all things being in the env and cuda working - but I also have a working setup for torch 1.12, so might be picking up things from that as well. I haven't dared do a full upgrade to torch 2 yet, so it's a bit out of my knowledge domain at the moment. Just as a dumb check, are you pulling in
nvidia-nvjitlink-cu12
? I'm not sure if that's meant to be included but isn't, or should be provided by the host.
šŸ™‚ 1
You also shouldn't have to list all those dependencies explicitly as long as you pull in torch, which pulls them in transitively. Are you maybe on a mac system and that somehow confuses it? Or locking from a mac and consuming from Linux? The torch wheels are quite malformed and don't include platform-marker guards, instead changing dependency lists depending on which platform wheel you look at.
l
Tom, I am eagerly awaiting your definitive guide on this stuff. I think it will be a joy to read.
šŸ’Æ 1
g
My life as an ML engineer
ā¤ļø 4
l
Look, you already have the cover art for your article right there.
g
So we are locking on a mac with a CPU and we're wanting to then run it on Linux with a GPU - is it worth having a different set of locks for the pytorch parts of the project?
Okay so I've gone and added a new resolve, updated resolves through out th eproject to point to it, and generated the lockfile on the linux box with the GPU itself and still run into the same issue
Copy code
OSError: libcufft.so.11: cannot open shared object file: No such file or directory
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory
g
Are you able to share that linux-based lockfile?
g
yeah for sure
g
Hmm. Both of those seem to be existing in the lockfile at least. Let me see if I can whip up a repro... I've left the office so can't see what my lockfile looked like. I had a
torch>=2
I did for another test the other day without issue.
g
Something that may be an issue:
Copy code
ubuntu@inference-1:~$ sudo find / -name 'libcublas.so.*' 
/usr/lib/x86_64-linux-gnu/libcublas.so.11.7.4.6
/usr/lib/x86_64-linux-gnu/libcublas.so.11
/home/ubuntu/.pex/installed_wheels/5f88a50378f71b7b289df3fd81567f4a61f8cdf8d103726a9218b500a0f78f6b/nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl/nvidia/cublas/lib/libcublas.so.12
/home/ubuntu/.cache/pants/named_caches/pex_root/installed_wheels/ee53ccca76a6fc08fb9701aa95b6ceb242cdaab118c3bb152af4e579af792728/nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl/nvidia/cublas/lib/libcublas.so.12
Could it be randomly finding the system version?
Or whatever that stuff in /usr/lib is?
g
It could; yeah. I pushed my stuff here: https://github.com/tgolsson/pants-repros/tree/main/torch-2,
pants run cuda.py
gives me quite sensible outputs. Neither of those libs it complains about in your last post anywhere on my machine outside Pants, and import torch works. Not sure how far it gets before it falls over for you.
g
Let me try run that and see what happens
g
šŸ‘ FWIW, on my machine I have the following output: • Torch version • Nvidia-smi • Error about no nvcc • Error about libcuda.so • Info about device (which it finds using
libcuda.so.1
)
g
Yeah that worked
So I managed to successfully run the repo you just gave as an example with
pants run cuda.py
g
Interesting. So the question is when those libraries would get loaded, and if that happens in this test. I tried adding
torch.cuda.init()
and assert on
torch.cuda.is_initialized()
and that also worked on my machine. But it might be a lazy loading thing where it only happens once you use @torch.jit or some specific op, for example.
I'd expect that to show in your callstack if it happened much later though.
g
I mean this is the whole stack:
Copy code
pants run src/python/dh/infer/tasks:generate                                                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                                                           
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 174, in _load_global_deps                           
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)                                                                                                                                                                                                           
  File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__                                                                                                                                                                                       
    self._handle = _dlopen(self._name, mode)                                                                                                                                                                                                                 
                   ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
OSError: libcufft.so.11: cannot open shared object file: No such file or directory                                                                                                                                                                           
                                                                                                                                                                                                                                                             
During handling of the above exception, another exception occurred:                                                                                                                                                                                          
                                                                                                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                                                                           
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/pex", line 274, in <module>                                                                               
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)                                                                                                                                                                                       
  File "<frozen runpy>", line 226, in run_module                                                                                                                                                                                                             
  File "<frozen runpy>", line 98, in _run_module_code                                                                                                                                                                                                        
  File "<frozen runpy>", line 88, in _run_code                                                                                
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/main.py", line 3, in <module>
    from .app import app_factory                                                                                              
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/app.py", line 3, in <module>
    from .generate import generate_task                                                                                                                                                                                                                      
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/generate.py", line 11, in <module>
    from ..models.mask import generate_mask                                                                                                                                                                                                                  
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/models/mask.py", line 1, in <module>                                      
    import torch                                                                                                                                                                                                                                             
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 234, in <module>                                       
    _load_global_deps()                                                                                                                                                                                                                                      
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 195, in _load_global_deps        
    _preload_cuda_deps(lib_folder, lib_name)                                                                                                                                                                                                                 
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 161, in _preload_cuda_deps
    ctypes.CDLL(lib_path)                    
  File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory
make: *** [src/python/dh/infer/Makefile:5: dh-infer-task-generate] Error 1
g
So now the question is if the
pex_binary
vs
python_source
is important. Both build a pex, but they execute differently.
g
Ok
I didn't know there was much of a difference
Okay I'll try run directly:
Copy code
pants run src/python/dh/infer/tasks/main.py
Rather than the pex binary target
g
There shouldn't be (functionally), but they're constructed slightly differently. Others know more about the exact differences, but when running a source directly the dependencies are built separately from the sources, while a pex_binary puts sources and deps in together. This means running a source directly is much more efficent, as source-level changes don't rebuild the whole pex, whereas a
pex_binary
target will rebuild on any source change IME.
g
NO WAY THAT WORKED
Okay well there's something with the pex binary
fuck it we can run with this omg thank you so much!!!!
g
Nice 🄳 I have a slight idea what is the issue here, but will have to dig a bit before filing an issue. But it matches why it works in our development, our big pex's with GPU support all build with
execution_mode="venv"
and get shoved into containers. All our local development we run file targets (though we primarily use
cmd:{train,server}
etc for teachability). So that reflects what we've seen here. I did a cursory glance at the code and it looks like the run-source code path does build a venv as well and links it to the sources, so that checks out.
Thanks for your patience working through this šŸ™‚
g
Thank me? Thank you! Its also less patience and more stubbornness. Please link the issue so I can try and understand. w.r.t. execution_mode=venv I actually had that set in my pex_binary. I'll admit I don't understand how/why its needed but was a copy/paste job from the docs when putting the other services into containers. We're not yet at the point where I'm able to run these in docker containers, I'm just trying to bite things off one step at a time. pex_binary( name="generate", entry_point="main.py", dependencies=[":src"], resolve="infer", args=["generate"], layout="packed", execution_mode="venv", restartable=True, )
This is what I had been just running directly in dev - and if this was a "non gpu service" I'd just dump that pex into a docker container (as per the docker docs)
But for this use case, we're running just directly on the hose instance with the GPU so we figured just running source on there is a good starting point before going to production just to get all the systems working together.
g
Interesting, so maybe the execution_mode is a red herring. I'll file an issue sometime during the weekend when I have time, and see if I can repro your issue. šŸ™‚ It's a shame the torch 2 pexes that such a long time to build.
Fwiw packaging the
cuda.py
as a
pex_binary
, then extracting and running the issue at least. So likely some path shenanigans going wrong...
g
Oh nice so you get the same issue when using a
pex_binary
too? I mean I'd consider that a win. Thank you so much once again!
g
Ok; I'm not sure about the exact implications overall, but
loose
+
venv
works on my reproduction, likely because it ships everything. Edit: Nope, after clearing out
.pex
it breaks again. So nevermind that. But I've found a smoking gun!