Hey everyone I m looking for some help regarding using pants Pants #general

Hey everyone - I'm looking for some help regarding...

great-river-11779

11/17/2023, 1:22 AM

Hey everyone - I'm looking for some help regarding using pants with Pytorch and the whole gpu/cpu situation. I've been reading the slack logs + github issues from 1-2 years ago trying to figure out how things work and what I can do to get things going and I can't for the life of me figure anything out. It seems like all the comments were from years ago too - so maybe I'm the only one who can't get things working? Any guidance/path forward would be much appreciated. In terms of errors I'm getting libraries missing:

nvidia-cuda-nvrtc-cu12==12.1.105

nvidia-cuda-runtime-cu12==12.1.105

nvidia-cuda-cupti-cu12==12.1.105

...

etc When adding those to the requirements.txt by hand and to the BUILD depndency

pex_binary(

name="generate",

entry_point="main.py",

dependencies=[

":src",

"//:requirements#nvidia-cuda-nvrtc-cu12",

...

"//:requirements#nvidia-nvtx-cu12",

"//:requirements#triton",

],

args=["generate"],

layout="packed",

execution_mode="venv",

restartable=True,

I get issues like this:

OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory

make: *** [src/python/dh/infer/Makefile:5: dh-infer-task-generate] Error 1

Any help here would be greatly appreciated

gorgeous-winter-99296

11/17/2023, 11:21 AM

I should probably write a guide 😄 It's hard to tell exactly what the issue is from your errors. I think the best bet is to

pants package path/to:target

and see what's in there - pex is just a glorified zip, so

zipinfo

grep

is great to figure out what ended up in there. I did that with torch>=2 yesterday and at least saw all things being in the env and cuda working - but I also have a working setup for torch 1.12, so might be picking up things from that as well. I haven't dared do a full upgrade to torch 2 yet, so it's a bit out of my knowledge domain at the moment. Just as a dumb check, are you pulling in

nvidia-nvjitlink-cu12

? I'm not sure if that's meant to be included but isn't, or should be provided by the host.

🙂 1

gorgeous-winter-99296

11/17/2023, 11:22 AM

You also shouldn't have to list all those dependencies explicitly as long as you pull in torch, which pulls them in transitively. Are you maybe on a mac system and that somehow confuses it? Or locking from a mac and consuming from Linux? The torch wheels are quite malformed and don't include platform-marker guards, instead changing dependency lists depending on which platform wheel you look at.

late-advantage-75311

11/17/2023, 11:23 AM

Tom, I am eagerly awaiting your definitive guide on this stuff. I think it will be a joy to read.

💯 1

gorgeous-winter-99296

11/17/2023, 11:25 AM

My life as an ML engineer

❤️ 4

late-advantage-75311

11/17/2023, 11:26 AM

Look, you already have the cover art for your article right there.

great-river-11779

11/17/2023, 5:40 PM

So we are locking on a mac with a CPU and we're wanting to then run it on Linux with a GPU - is it worth having a different set of locks for the pytorch parts of the project?

great-river-11779

11/17/2023, 6:48 PM

Okay so I've gone and added a new resolve, updated resolves through out th eproject to point to it, and generated the lockfile on the linux box with the GPU itself and still run into the same issue

great-river-11779

11/17/2023, 6:48 PM

Copy code

OSError: libcufft.so.11: cannot open shared object file: No such file or directory
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory

gorgeous-winter-99296

11/17/2023, 6:48 PM

Are you able to share that linux-based lockfile?

great-river-11779

11/17/2023, 6:49 PM

yeah for sure

great-river-11779

11/17/2023, 6:49 PM

https://gist.github.com/sf-chris/c8d17a51cbfb0219cf4492a6fb785e23

gorgeous-winter-99296

11/17/2023, 6:52 PM

Hmm. Both of those seem to be existing in the lockfile at least. Let me see if I can whip up a repro... I've left the office so can't see what my lockfile looked like. I had a

torch>=2

I did for another test the other day without issue.

great-river-11779

11/17/2023, 6:57 PM

Something that may be an issue:

Copy code

ubuntu@inference-1:~$ sudo find / -name 'libcublas.so.*' 
/usr/lib/x86_64-linux-gnu/libcublas.so.11.7.4.6
/usr/lib/x86_64-linux-gnu/libcublas.so.11
/home/ubuntu/.pex/installed_wheels/5f88a50378f71b7b289df3fd81567f4a61f8cdf8d103726a9218b500a0f78f6b/nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl/nvidia/cublas/lib/libcublas.so.12
/home/ubuntu/.cache/pants/named_caches/pex_root/installed_wheels/ee53ccca76a6fc08fb9701aa95b6ceb242cdaab118c3bb152af4e579af792728/nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl/nvidia/cublas/lib/libcublas.so.12

great-river-11779

11/17/2023, 6:57 PM

Could it be randomly finding the system version?

great-river-11779

11/17/2023, 6:57 PM

Or whatever that stuff in /usr/lib is?

gorgeous-winter-99296

11/17/2023, 7:13 PM

It could; yeah. I pushed my stuff here: https://github.com/tgolsson/pants-repros/tree/main/torch-2,

pants run cuda.py

gives me quite sensible outputs. Neither of those libs it complains about in your last post anywhere on my machine outside Pants, and import torch works. Not sure how far it gets before it falls over for you.

great-river-11779

11/17/2023, 7:17 PM

Let me try run that and see what happens

gorgeous-winter-99296

11/17/2023, 7:19 PM

👍 FWIW, on my machine I have the following output: • Torch version • Nvidia-smi • Error about no nvcc • Error about libcuda.so • Info about device (which it finds using

libcuda.so.1

)

great-river-11779

11/17/2023, 7:22 PM

Yeah that worked

great-river-11779

11/17/2023, 7:24 PM

So I managed to successfully run the repo you just gave as an example with

pants run cuda.py

gorgeous-winter-99296

11/17/2023, 7:26 PM

Interesting. So the question is when those libraries would get loaded, and if that happens in this test. I tried adding

torch.cuda.init()

and assert on

torch.cuda.is_initialized()

and that also worked on my machine. But it might be a lazy loading thing where it only happens once you use @torch.jit or some specific op, for example.

gorgeous-winter-99296

11/17/2023, 7:27 PM

I'd expect that to show in your callstack if it happened much later though.

great-river-11779

11/17/2023, 7:28 PM

I mean this is the whole stack:

Copy code

pants run src/python/dh/infer/tasks:generate                                                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                                                           
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 174, in _load_global_deps                           
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)                                                                                                                                                                                                           
  File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__                                                                                                                                                                                       
    self._handle = _dlopen(self._name, mode)                                                                                                                                                                                                                 
                   ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
OSError: libcufft.so.11: cannot open shared object file: No such file or directory                                                                                                                                                                           
                                                                                                                                                                                                                                                             
During handling of the above exception, another exception occurred:                                                                                                                                                                                          
                                                                                                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                                                                           
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/pex", line 274, in <module>                                                                               
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)                                                                                                                                                                                       
  File "<frozen runpy>", line 226, in run_module                                                                                                                                                                                                             
  File "<frozen runpy>", line 98, in _run_module_code                                                                                                                                                                                                        
  File "<frozen runpy>", line 88, in _run_code                                                                                
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/main.py", line 3, in <module>
    from .app import app_factory                                                                                              
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/app.py", line 3, in <module>
    from .generate import generate_task                                                                                                                                                                                                                      
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/tasks/generate.py", line 11, in <module>
    from ..models.mask import generate_mask                                                                                                                                                                                                                  
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/dh/infer/models/mask.py", line 1, in <module>                                      
    import torch                                                                                                                                                                                                                                             
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 234, in <module>                                       
    _load_global_deps()                                                                                                                                                                                                                                      
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 195, in _load_global_deps        
    _preload_cuda_deps(lib_folder, lib_name)                                                                                                                                                                                                                 
  File "/home/ubuntu/.pex/venvs/8883b0c3afc71f729bee63cebaf2e75004aa475a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.11/site-packages/torch/__init__.py", line 161, in _preload_cuda_deps
    ctypes.CDLL(lib_path)                    
  File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libnvJitLink.so.12: cannot open shared object file: No such file or directory
make: *** [src/python/dh/infer/Makefile:5: dh-infer-task-generate] Error 1

gorgeous-winter-99296

11/17/2023, 7:29 PM

So now the question is if the

pex_binary

python_source

is important. Both build a pex, but they execute differently.

great-river-11779

11/17/2023, 7:30 PM

great-river-11779

11/17/2023, 7:30 PM

I didn't know there was much of a difference

great-river-11779

11/17/2023, 7:31 PM

Okay I'll try run directly:

Copy code

pants run src/python/dh/infer/tasks/main.py

great-river-11779

11/17/2023, 7:31 PM

Rather than the pex binary target

gorgeous-winter-99296

11/17/2023, 7:32 PM

There shouldn't be (functionally), but they're constructed slightly differently. Others know more about the exact differences, but when running a source directly the dependencies are built separately from the sources, while a pex_binary puts sources and deps in together. This means running a source directly is much more efficent, as source-level changes don't rebuild the whole pex, whereas a

pex_binary

target will rebuild on any source change IME.

great-river-11779

11/17/2023, 7:34 PM

NO WAY THAT WORKED

great-river-11779

11/17/2023, 7:34 PM

Okay well there's something with the pex binary

great-river-11779

11/17/2023, 7:34 PM

fuck it we can run with this omg thank you so much!!!!

gorgeous-winter-99296

11/17/2023, 7:53 PM

Nice 🥳 I have a slight idea what is the issue here, but will have to dig a bit before filing an issue. But it matches why it works in our development, our big pex's with GPU support all build with

execution_mode="venv"

and get shoved into containers. All our local development we run file targets (though we primarily use

cmd:{train,server}

etc for teachability). So that reflects what we've seen here. I did a cursory glance at the code and it looks like the run-source code path does build a venv as well and links it to the sources, so that checks out.

gorgeous-winter-99296

11/17/2023, 7:53 PM

Thanks for your patience working through this 🙂

great-river-11779

11/17/2023, 7:57 PM

Thank me? Thank you! Its also less patience and more stubbornness. Please link the issue so I can try and understand. w.r.t. execution_mode=venv I actually had that set in my pex_binary. I'll admit I don't understand how/why its needed but was a copy/paste job from the docs when putting the other services into containers. We're not yet at the point where I'm able to run these in docker containers, I'm just trying to bite things off one step at a time. pex_binary( name="generate", entry_point="main.py", dependencies=[":src"], resolve="infer", args=["generate"], layout="packed", execution_mode="venv", restartable=True, )

great-river-11779

11/17/2023, 7:57 PM

This is what I had been just running directly in dev - and if this was a "non gpu service" I'd just dump that pex into a docker container (as per the docker docs)

great-river-11779

11/17/2023, 7:58 PM

But for this use case, we're running just directly on the hose instance with the GPU so we figured just running source on there is a good starting point before going to production just to get all the systems working together.

gorgeous-winter-99296

11/17/2023, 8:04 PM

Interesting, so maybe the execution_mode is a red herring. I'll file an issue sometime during the weekend when I have time, and see if I can repro your issue. 🙂 It's a shame the torch 2 pexes that such a long time to build.

gorgeous-winter-99296

11/17/2023, 8:05 PM

Fwiw packaging the

cuda.py

as a

pex_binary

, then extracting and running the issue at least. So likely some path shenanigans going wrong...

great-river-11779

11/17/2023, 8:24 PM

Oh nice so you get the same issue when using a

pex_binary

too? I mean I'd consider that a win. Thank you so much once again!

gorgeous-winter-99296

11/18/2023, 1:26 PM

Ok; I'm not sure about the exact implications overall, but

loose

venv

works on my reproduction, likely because it ships everything. Edit: Nope, after clearing out

.pex

it breaks again. So nevermind that. But I've found a smoking gun!

gorgeous-winter-99296

11/18/2023, 1:55 PM

https://github.com/pantsbuild/pants/issues/20205

Open in Slack

Previous Next