Has anyone (successfully) created a pex file capab...
# general
f
Has anyone (successfully) created a pex file capable of running a PyTorch model on gpu. I can get CPU working but i think the CUDA libraries aren't available when running the .pex. I assume they're not included in the environment.
g
Are we talking about bare execution or in a container? I've got both working, though we have explicitly opted to remain on a Torch version that doesn't source CUDA from
pip
.
(Which I think was added in 1.13, as we're on 1.12)
I have this script I use to debug cuda, might be useful for you too.
It'll try nvidia-smi, nvcc, plus query a bunch of things from the CUDA context
p
We gave up on it and shoved raw source files directly into a container based on one of nvidia's containers.
f
Bare execution rather than in a container for now. The machine has CUDA installed and we have containers with CUDA installed. I wasn't anticipating there being any additional issue using a pex on a container vs directly on the machine. My issue seems to be that the pants run gives this error:
Copy code
158, in convert
    return <http://t.to|t.to>(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/matt/.pex/installed_wheels/7a7b3b5493d0fd45b9b049942416d44b857def07219a240ec0d55eb65673b7b4/torch-2.1.0+rocm5.6-cp310-cp310-linux_x86_64.whl/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No HIP GPUs are available
Which makes me think it is not able to interact with CUDA, perhaps because the hermetic environment is not allowing access to it? It doesn't sounds like you've had any issue with CUDA/GPU for pants run in your cases?
g
The one issue we've had is compute version, which is a HW mismatch primarily. And of course, torch wheels being quite bad. But that error message makes me think a ROCM issue, maybe missing some environment setup? I've not done AMD GPU work in the last five years so I have no idea about failure cases there.
f
This is on nvidia GPU machine so i don't think ROCM can be related, although that's the only other example i can find online of this error message. and any googling of HIP GPUs results in AMD gpu stuff....
g
You are using the rocm torch variant, so I think that is wrong then. :-)
f
oh yeah, good spot. I'll have to work out why on earth it's chosen that variant... Thanks 👍
g
No problems. I'm going to guess that you have a rocm index listed in your pants.toml, which would lead to it being picked. Per PEP440, local versions should be picked with preference, so
1.2+foo
is better than
1.2
.
f
yh unfortunately i don't think they've used +foo on their packages since moving to pytorch 2.x.x :(
g
The one you have at least uses one,
torch-2.1.0+rocm5.6
f
it seems it has them for everything except the version i want. i.e. cu118 exists, cpu exists but cu121 does not.... At least i can add some ! in to prevent now. Thanks
g
It looks like they're going towards a flatter layout, and +cu121 exists here: https://download.pytorch.org/whl/torch/
f
ok, so i'm getting a new error, i suspect that means progress. Unfortunately no mention of it on the internet and not much to go off:
Copy code
Failed to digest inputs: "Error storing Digest { hash: Fingerprint<8bcd54aa90ddf3f77d4732fb12163475b6d5c9f61bbf8ae729f30685c2ff97e7>, size_bytes: 4119326090 }: Input/output error"
There's more info above but im not sure how helpful it is. It looks like it can't write the file. Seems to be 4Gb. There is 197Gb of space on the machine. i can't think what else would cause an input/output error. Resolved this with the layout=packed. Thanks for all your help! :D
g
Hmm, did I ask what Pants version you're on? I think that issue should be resolved in 2.17, where there was a 2GB limit on files.
f
i was on 2.16, i bumped up to 2.18 as part of trying to solve this, but i think i did that at the sametime as changing the layout to packed. I'll check on monday whether it;s layour or version change (or either) that fixed it)
g
Ack. I think this:
Failed to digest inputs: "Error storing Digest { hash: Fingerprint<8bcd54aa90ddf3f77d4732fb12163475b6d5c9f61bbf8ae729f30685c2ff97e7>, size_bytes: 4119326090 }: Input/output error"
implies 2.16, so that corroborates your story.
👍 1