I opened an issue on the GitHub repo <https github com pants Pants #general

I opened an issue on the GitHub repo (<https://git...

bland-father-19717

09/15/2023, 12:07 PM

I opened an issue on the GitHub repo (https://github.com/pantsbuild/pants/issues/19505). If there’s any additional information you need from me, please let me know. I’m eager to help in any way I can.

enough-analyst-54434

09/15/2023, 12:30 PM

The issue is you build the PEX on and for Mac and then try to use it in a Linux container. This will never work for any PEX with native dependencies (you have several of these according to the error message). To build a PEX for a foreign platform you need to specify

pkatforms

complete_platforms

on your

pex_binary

target. I'll add more detail to the issue in a few hours about how to do this, but you might read here: https://www.pantsbuild.org/docs/reference-pex_binary#codecomplete_platformscode

bland-father-19717

09/15/2023, 12:33 PM

Thank you so much for the clarification. I wasn’t aware of the platform-specific intricacies with PEX. I’ll try building the PEX with the

platforms

complete_platforms

as you suggested. I’ll also go through the documentation link you provided. Looking forward to the additional details on the issue.

curved-television-6568

09/15/2023, 1:38 PM

there’s plenty more in this slack if you search using cross platform pex docker as keywords.. 😉

bland-father-19717

09/15/2023, 1:43 PM

I’m about to test it now. However, I’d like to mention that I encountered the same error even when I built it in the same environment as the execution environment. I’ve been using pantsbuild cross-platform without any issues, but this problem arose for the first time when I added whisper. I suspect there might be an issue with the wheels of dependencies like pytorch that whisper relies on. Nevertheless, I’ll start by trying the build in the same environment as you suggested.

curved-television-6568

09/15/2023, 1:51 PM

yes, as long as you have platform agnostic dependencies, there’s no issues 😉

enough-analyst-54434

09/15/2023, 1:54 PM

I read too fast - you're trying to use a docker environment which means the PEX file is built inside docker which negates all I said. The new issue though is you're trying to target a Linux x86_64 image from Mac arm. Still digging, but that is territory I'm less familiar with.

curved-television-6568

09/15/2023, 1:55 PM

ah, the docker container might be running as an arm platform…

enough-analyst-54434

09/15/2023, 1:57 PM

Yeah @bland-father-19717 one missing step in your repro setup is providing a README that shows the commands you run that lead to failure. It's great you provided a repo, most do not do that, but that missing bit can be critical.

bland-father-19717

09/15/2023, 2:01 PM

I apologize, I’ll add the command to reproduce the issue.

enough-analyst-54434

09/15/2023, 2:41 PM

Ok, even when you get past the arm / x86_64 platform mismatch issue, you'll then hit this lovely bit of insanity (I knew this in the past but had forgotten):

Copy code

(example.venv) jsirois@Gill-Windows:~/support/pants/peachanG $ unzip -qc ~/downloads/torch-2.0.1-cp38-none-macosx_11_0_arm64.whl torch-2.0.1.dist-info/METADATA | grep Requires
Requires-Python: >=3.8.0
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
(example.venv) jsirois@Gill-Windows:~/support/pants/peachanG $ unzip -qc ~/downloads/torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl torch-2.0.1.dist-info/METADATA | grep Requires
Requires-Python: >=3.8.0
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'

That differing metadata per-wheel defeats Pex locking utterly. Pex assumes all artifacts for a given project version will contain the same requirement metadata; in other words, it does not download all the available artifacts for a given version (many terabytes worth in the torch case), to get the requirement metadata. It just downloads 1. It looks like your lock downloaded the mac wheel and thus leaves out all the nvidia requirements. As such, you can never build a proper PEX for linux using your lock file.

enough-analyst-54434

09/15/2023, 2:43 PM

@bland-father-19717 you might search slack and Pants issues for torch or pytorch. Others have run into torch difficulties and they have - iirc - hacky solutions of one form or the other. Torch is just a very bad Python ecosystem citizen here. It presents an important - due to popularity - but hard to solve case.

enough-analyst-54434

09/15/2023, 2:44 PM

If you were just using Pex alone, you'd not create a universal lock (which is what Pants does), but instead create a lock for Mac and another lock for Linux and this would all just work. So Pants gets in the way here on top of the torch madness.

bland-father-19717

09/15/2023, 2:53 PM

I’m well aware of the challenges posed by pytorch, and it’s indeed a tricky situation. However, I can’t dismiss the requests from our data scientists who wish to use pytorch. I’ll try to see if it works correctly in a Linux-only environment.

enough-analyst-54434

09/15/2023, 2:55 PM

@bland-father-19717 it asolutely will not work using a Pants-generated lockfile.

😱 1

enough-analyst-54434

09/15/2023, 2:56 PM

It might work by luck only if you generate the lockfile on a Linux machine. In that case the primary artifact where the metadata is read from may be the linux wheel which has the nvidia requirements in its metadata.

enough-analyst-54434

09/15/2023, 2:57 PM

So, the thing you really need to understand is the differing unzip listings I pasted above. That is the fundamental issue here (combined with Pants use of --style universal locks).

👀 1

enough-analyst-54434

09/15/2023, 3:09 PM

One way would be to transform all the

Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"

into manual dependencies in Pants. Something like:

Copy code

python_requirement(
    name="evil-torch-workaround",
    requirements=[
        "torch==2.0.1",
        'nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"',
        ...
    ],
)

Of course, that means to bump torch, you need to go research what the full union of its requirements are using unzip like I did above.

bland-father-19717

09/15/2023, 3:16 PM

Oh, thank you! I’ll give it a try.

bland-father-19717

09/18/2023, 2:21 PM

@enough-analyst-54434 I tested it in an AWS EC2 environment and encountered the following error. Do you happen to know a solution?

Copy code

/usr/local/bin/python3.8: can't find '__main__' module in '/bin/whisper'

Details: https://github.com/pantsbuild/pants/issues/19505#issuecomment-1723514425

bland-father-19717

09/18/2023, 2:23 PM

⭕️:

pants run src/python/main/whisper/main.py

❌:

pants run src/python/main/whisper:main

❌:

pants run src/python/main/whisper/Dockerfile

❌:

pants run src/python/main/whisper:whisper_docker

enough-analyst-54434

09/18/2023, 6:41 PM

Yeah, so - I know you know, but I'll re-iterate that torch is insane. The PEX zip is too big (~2.3 GB compressed and ~4.5GB uncompressed) for the python zipimporter (which is what handles launching zipapps). Although Python the language has a zipfile module that handles zip64, the zipimporter uses different C code that does not. As such, the zipimporter launcher fails to be able to read the zip properly and cannot find the

__main__.py

inside even though it is there (I had to install zip and unzip inside the image but I omitted these steps below):

Copy code

$ docker run --rm -it --entrypoint bash whisper_docker:latest
root@f8fa69b3a6f2:/# ls -lrth /bin/whisper
-r-xr-xr-x 1 root root 2.3G Sep 18 17:25 /bin/whisper
root@f8fa69b3a6f2:/# zipinfo /bin/whisper | tail -1
19515 files, 4576079657 bytes uncompressed, 2368282036 bytes compressed:  48.2%

As an experiment I removed 1 ~600MB file from the zip to bring it under 4GB uncompressed:

Copy code

root@f8fa69b3a6f2:/# cp /bin/whisper /bin/whisper.zip
root@f8fa69b3a6f2:/# zip -d /bin/whisper.zip .deps/torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl/torch/lib/libtorch_cuda.so
deleting: .deps/torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl/torch/lib/libtorch_cuda.so
        zip warning: Local Version Needed To Extract does not match CD: .deps/torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl/torch/lib/libtorch_cuda_linalg.so
...
root@f8fa69b3a6f2:/# ls -lrth /bin | grep whisper
-r-xr-xr-x 1 root root 2.3G Sep 18 17:25 whisper
-r-xr-xr-x 1 root root 1.9G Sep 18 17:41 whisper.zip
root@f8fa69b3a6f2:/# zipinfo /bin/whisper.zip | tail -1
19514 files, 3919218968 bytes uncompressed, 1970217874 bytes compressed:  49.7%

That then ~works:

Copy code

root@f8fa69b3a6f2:/# whisper.zip
Traceback (most recent call last):
  File "/root/.pex/venvs/75e1762e0292c94b683f75ebf8977148b3c6943e/5fd7049af63e03f347278c89401424cd9731df9a/pex", line 274, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/usr/local/lib/python3.8/runpy.py", line 207, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/usr/local/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.pex/venvs/75e1762e0292c94b683f75ebf8977148b3c6943e/5fd7049af63e03f347278c89401424cd9731df9a/lib/python3.8/site-packages/main/whisper/main.py", line 1, in <module>
    import whisper
  File "/root/.pex/venvs/75e1762e0292c94b683f75ebf8977148b3c6943e/5fd7049af63e03f347278c89401424cd9731df9a/lib/python3.8/site-packages/whisper/__init__.py", line 8, in <module>
    import torch
  File "/root/.pex/venvs/75e1762e0292c94b683f75ebf8977148b3c6943e/5fd7049af63e03f347278c89401424cd9731df9a/lib/python3.8/site-packages/torch/__init__.py", line 229, in <module>
    from torch._C import *  # noqa: F403
ImportError: libtorch_cuda.so: cannot open shared object file: No such file or directory

So ... your easiest option is to use the

pex_binary

support for Pex's packed layout which packages the PEX in a special directory-based format instead of in a zip file. You get this via `layout="packed"`: https://www.pantsbuild.org/docs/reference-pex_binary#codelayoutcode When I add that

layout="packed"

to your example repo

pex_binary

target and update the Dockerfile entrypoint to:

Copy code

ENTRYPOINT ["/usr/local/bin/python3.8", "/bin/whisper"]

I get:

Copy code

$ time docker run --rm -it whisper_docker:latest
/root/.pex/installed_wheels/0d1004abc525c92a0e0befc850db2ffe4b4f80e9eb8875b1459d5a3a270880be/openai_whisper-20230314-py3-none-any.whl/whisper/timing.py:58: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See <https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit> for details.
  def backtrace(trace: np.ndarray):
100%|███████████████████████████████████████| 139M/139M [00:13<00:00, 10.6MiB/s]
Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 512, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(512, 512, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (mlp_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TextDecoder(
    (token_embedding): Embedding(51865, 512)
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (cross_attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (mlp_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
)

real    0m41.540s
user    0m0.010s
sys     0m0.020s

So I think that solves all the issues The startup time is horrendous though and need not be (at the expense of some extra docker image build time). To move the startup overhead into docker image build overhead, also add

execution_mode="venv"

to your

pex_binary

target and change the Dockerfile to be like so:

Copy code

FROM python:3.8.17-slim-bullseye

COPY src.python.main.whisper/main.pex /tmp/main.pex
RUN \
    PEX_TOOLS=1 /usr/local/bin/python3.8 /tmp/main.pex venv \
        --remove all \
        --compile \
        --bin-path prepend \
        /bin/whisper

ENTRYPOINT ["/bin/whisper/pex"]

Then you get:

Copy code

$ time docker run --rm -it whisper_docker:latest
/bin/whisper/lib/python3.8/site-packages/whisper/timing.py:58: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See <https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit> for details.
  def backtrace(trace: np.ndarray):
100%|███████████████████████████████████████| 139M/139M [00:10<00:00, 14.3MiB/s]
Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 512, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(512, 512, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (mlp_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TextDecoder(
    (token_embedding): Embedding(51865, 512)
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (cross_attn): MultiHeadAttention(
          (query): Linear(in_features=512, out_features=512, bias=True)
          (key): Linear(in_features=512, out_features=512, bias=False)
          (value): Linear(in_features=512, out_features=512, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (cross_attn_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (mlp_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
)

real    0m12.823s
user    0m0.000s
sys     0m0.028s

Still pretty bad buit better.

enough-analyst-54434

09/18/2023, 6:48 PM

Ok, @bland-father-19717 and @curved-television-6568 I've updated the issue with this worked explanation: https://github.com/pantsbuild/pants/issues/19505#issuecomment-1724181440

🙏 2

👍 1

enough-analyst-54434

09/18/2023, 6:56 PM

And, FWIW, here is the CPython tracking issue for zipimporter vs zip64: https://bugs.python.org/issue32959 (https://github.com/python/cpython/issues/77140)

enough-analyst-54434

09/18/2023, 6:58 PM

Actually, this looks like the best tracking issue for the ongoing work to fix this zipimport issue: https://github.com/python/cpython/pull/94146

enough-analyst-54434

09/18/2023, 11:31 PM

And here's a bug to maybe warn or fail fast at PEX zip creation time when the zip64 conditions are met: https://github.com/pantsbuild/pex/issues/2247 This is not easy to get right though fwict and I always favor not enraging an advanced user that knows what they're doing by getting in the way over coddling; so we'll see.

👀 1

curved-television-6568

09/18/2023, 11:45 PM

coddle-mode: Option[bool]=False

enough-analyst-54434

09/18/2023, 11:54 PM

If that means "definitive nyet", Tsoding approves.

😂 1

curved-television-6568

09/19/2023, 1:11 AM

it hints at a coddling mode, that you can toggle.. 😁

bland-father-19717

09/19/2023, 1:11 AM

@enough-analyst-54434 I wanted to express my deep appreciation for the in-depth analysis, solution, and optimization suggestions you provided on GitHub. After implementing the changes, everything is running smoothly in my environment, and the optimizations have significantly improved performance. Your clear and detailed guidance was invaluable. I’m grateful to be a part of such a knowledgeable and helpful community. Thanks again for all the support!

❤️ 1

enough-analyst-54434

09/19/2023, 2:10 AM

You're welcome @bland-father-19717. Note that there is likely more optimization you could be doing depending on your development / deploy lifecycle shape, see https://blog.pantsbuild.org/optimizing-python-docker-deploys-using-pants/ for ideas.

🙏 1

👍 1

enough-analyst-54434

09/30/2023, 2:00 AM

Ok, and here is a new feature that now warns by default when the generated PEX zipapp is too big (Pants could choose to

--check error

or expose the toggle if it wishes): https://github.com/pantsbuild/pex/pull/2253

bland-father-19717

09/30/2023, 2:07 AM

Excellent! Thanks!

2 Views

Open in Slack

Previous Next