Hello pants community I have a situation which I think might Pants #general

Hello pants community. I have a situation which I ...

adventurous-greece-2061

06/20/2024, 3:22 PM

Hello pants community. I have a situation which I think might be a good use case for pants, but I can’t seem to quite get it to work, probably because I’m confused about how some of the concepts work. Because it's a long explanation of what I'm trying to do, I'm going to put it into a 🧵 so I don't dump a wall of text into the main channel.

👍 1

adventurous-greece-2061

06/20/2024, 3:23 PM

To start, let me explain what our current situation is: At my work we have two Python monorepos, let’s call them

core

and

data-science

. Their source trees look basically like this:

Copy code

core
├── io
├── spatial
├── image_processing
├── chemistry
├── etc.

data-science
├── rasters
├── stats
├── machine_learning
├── etc.

These various modules have some dependencies between them, so for example, in

core

image_processing

may depend on

io

but not on

chemistry

. Likewise, in

data-science

, the

machine_learning

submodule may depend on

stats

but not the other way around, nor does it depend on e.g.

rasters

. (This is all geared toward geospatial computations, hence the names). It is possible to untangle these dependencies in such a way as to arrange them in a DAG although at the moment in practice there's some coupling between them. Now, the build process we have going takes the

core

monorepo and generates a wheel from it which is stored in a private index. The

requirements.txt

for this wheel is compiled using

pip-compile

with various

.in

files as inputs, and as the repo has been around for a while, it has accumulated quite a large number of requirements. For the

data-science

repo, we then add

core==whatever

to the

.in

requirements input, and then the same

pip-compile

is used to create the

requirements.txt

for

data-science

. Finally, the

data-science

build produces not a wheel but a Docker image which can be mounted in e.g. jupyterhub. So basically the goal is to get our data scientists one mega-image with the whole environment that they can use to do their work. This works pretty well, but the problem we're running into is two-fold: first, some of the components of the

data-science

monorepo are actually quite useful to other downstream projects, and it's hard to break them out independently. In my example above,

rasters

is a module for raster processing of geospatial data, and has minimal dependency on other code within

core

and no dependency on code within

data-science

. It would be ideal to let other projects simply have a

mycompany-rasters==whatever

requirement but right now that isn't possible. Second, building the monorepo requires running all of the tests in it and building a fairly huge image. For example, the

machine_learning

submodule depends on

torch

but most of our users do not use `torch`; even so they must pay the price of waiting for that image to be built and the associated tests to be run even if they're working on something that does not depend on

machine_learning

in any way. Furthermore, updating

core

requires the "PR dance" where an accompanying PR needs to be made based on the development

core

version in

data-science

to demonstrate that it doesn't break anything, then the

core

PR gets merged to master, then the downstream gets updated to depend on the master version of

core

. As you can imagine it's quite tedious and time-consuming.

adventurous-greece-2061

06/20/2024, 3:23 PM

Ok, I hope that sets the scene appropriately. Here's what I'm trying to accomplish and I am wondering if pants is the right tool to do it. First, I would like to have a repo in which we have two namespace projects with a structure that looks like this:

Copy code

src/core/
├── io
├── spatial
├── image_processing
├── chemistry
├── etc.

src/data_science/
├── rasters
├── machine_learning
├── stats
├── remote_sensing
├── etc.

Then, I would like to have internal interdependencies between the projects. For example,

rasters

depends on

core.spatial

and

<http://core.io|core.io>

but not on anything else. The goal is to be ultimately able to build

rasters

as an independent library,

mycompany-rasters

that other subprojects, such as

remote_sensing

, could depend on. Of course we also want to generate an artifact that projects external to these two monorepos can also depend on. Sorry for the wall of words, but it's a complicated setup. Ultimately my question is: is pants the right tool for accomplishing this? I have read the documentation but I'm still a bit hazy on a) how I would set up the requirements and resolvers for this kind of thing, and b) how I would encode the fact that e.g.

remote_sensing

depends on

rasters

and should be built into a wheel containing only those dependencies that are actually used in

remote_sensing

i.e. I do not want to bundle

torch

into that library because it doesn't need it.

adventurous-greece-2061

06/20/2024, 3:29 PM

So far, what I have tried is having a

BUILD

in the root of the project (outside the source tree) that looks like this:

Copy code

python_requirements(
    name="reqs",
    source="requirements.txt",
    resolve="reqs"
)

and then in the source tree itself (e.g.

src/core

) a

BUILD

that is something like:

Copy code

python_sources(
    name="spatial-src",
    sources=["spatial/**/*.py"],
    resolve="reqs"
)

python_distribution(
    name="spatial",
    dependencies=[":spatial-src", "//:reqs"],
    wheel=True,
    sdist=True,
    provides=python_artifact(
        name="spatial",
        version="0.0.1",
        description="spatial",
    ),
)

while this does build the

spatial

wheel, that wheel's

setup.py

includes all of the packages from the resolver, which is not what I want. Also, I don't know how to make

rasters

depend on

spatial

but that's a different question, it would be great to solve this first.

curved-television-6568

06/20/2024, 3:38 PM

Remove the //:reqs from the dist target. That’s why it includes the entire resolve. Pants is smart enough to figure out what to include so only depend on you top level first party code you want to ship.

adventurous-greece-2061

06/20/2024, 3:38 PM

Thanks! I will give that a try.

curved-television-6568

06/20/2024, 3:38 PM

Oh, and Hi 👋 and welcome to our community:)

adventurous-greece-2061

06/20/2024, 3:39 PM

👋

adventurous-greece-2061

06/20/2024, 3:42 PM

Ok, amazing, thanks Andreas! That worked exactly as intended! Now for my next question, how do I get

data_science/rasters

depending on

core/spatial

? Is it the same thing where just having the

import

in the code is enough for pants to figure it out?

curved-television-6568

06/20/2024, 4:02 PM

Yea, that should be enough 👍

adventurous-greece-2061

06/20/2024, 4:39 PM

Indeed, that also worked: I see a

spatial==0.0.1

line in the

setup.py

of my

rasters

project. Just so I understand this better: when pants is working on the repo, it doesn't need the dependent projects to actually exist as built artifacts, right? It "just knows" that I have

spatial==0.0.1

in the tree (because I have specified it as a build target) and then figures out how to build things that depend on it?

adventurous-greece-2061

06/20/2024, 4:42 PM

And when it comes to updating my requirements, what would be the best way to do dependency resolution? Should I have a separate

pyproject.toml

for each of my subprojects, list the dependencies there, and then unify them somehow at the top level? Right now as I said they're already generated via

pip-compile

from

.in

files, but is there a more idiomatic way to do this in pants?

happy-kitchen-89482

06/22/2024, 11:33 AM

Correct - when you run tests or do stuff in the repo Pants operates directly from sources, and you don't need to build wheels (although that can be forced to happen if necessary, e.g., for native code)

👍 1

2 Views

Open in Slack

Previous Next