I'm sorry to ask, but what's the current state of ...
# general
a
I'm sorry to ask, but what's the current state of the art for managing torch across architectures? Is there a decent example somewhere? I have some engineers running Linux with/without gpu, and some engineers running recent macs.
g
We've been using the same approach for ~2 years or so, which is three resolves - one for CPU, one for GPU with pinned cuda, and one generic. We've tried alternatives when we get tired of the parametrization and lockfile management but this is the most stable for us.
a
Makes sense - how does that work in practice? When you run, eg,
pants test
or
pants package
how are you selecting the correct resolve?
g
Most commands just default to
base
(=
[python].default-resolve
), so if you don't specify a parametrization that is what you get OOTB. It's also the only one that works on Mac, since the +cpu don't exist and GPU doesn't work (bar rocm, but we don't support that). Most of our serious work happens in our cloud system either way, so the
@parametrization=gpu
is mostly used in specific dev flows by our researchers and when building our containers. Our pre-commit etc also forces @cpu, primarily because it's much quicker when you can bypass all the cuda library packages, torch kernels, etc. Same with CI, they don't have CPUs. We also have flag aliases set up like
--with-cpu = "--python-default-resolve=cpu"
.
The major caveat is that it's sometimes funky to generate lockfiles on mac... it's gotten better I think, but we've had issues with torch only declaring their platform dependencies in their platform wheels -- so a mac user doesn't see a conditional declare for cuda at all, and their lockfiles are broken... not sure if that's fixed, it's been a while since I had to fix it in our repo.
a
Hmmmm. I've just gone through a process of separating resolves, so I now have one for inference, one for data pipelines, one for edge etc. seems like what I would need is to have multiple inference resolves and then work out some dx friendly way to select one.
g
Oof; yeah. I've generally found things work better the fewer resolves we have; and this is the minimal I can get away with for any actual code we develop. We also make this work with aggressive use of parametrization + defaults... that has some spectacular sharp corners when you use named parametrizations. But this is pretty much the root declaration that makes it all work. This applies to our whole repository, except where we override it.
Copy code
__defaults__(
    {
        pex_binary: dict(execution_mode="venv", venv_site_packages_copies=True),
        (python_source, python_sources): dict(
            **parametrize("cpu", resolve="cpu", skip_pyright=True),
            **parametrize("gpu", resolve="gpu", skip_pyright=True),
            **parametrize("base", resolve="base"),
        ),
        python_distribution: dict(skip_twine=True),
    }
)
👍 1
a
Yeah, I got to a point where I could no longer generate lock files or update things to get security patches, and where some weird ml dep was holding back our ability to adopt different tooling elsewhere. Hence the other thread :D
g
Yeah -- I work in a purely RL team and we build almost purely on torch, so we can avoid that. I know some of the data and inference tools are gnarly complex to the point of having circular dependencies on each other.
☝️ 1