Hello pants community. I have a situation which I ...
# general
a
Hello pants community. I have a situation which I think might be a good use case for pants, but I can’t seem to quite get it to work, probably because I’m confused about how some of the concepts work. Because it's a long explanation of what I'm trying to do, I'm going to put it into a 🧡 so I don't dump a wall of text into the main channel.
πŸ‘ 1
To start, let me explain what our current situation is: At my work we have two Python monorepos, let’s call them
core
and
data-science
. Their source trees look basically like this:
Copy code
core
β”œβ”€β”€ io
β”œβ”€β”€ spatial
β”œβ”€β”€ image_processing
β”œβ”€β”€ chemistry
β”œβ”€β”€ etc.

data-science
β”œβ”€β”€ rasters
β”œβ”€β”€ stats
β”œβ”€β”€ machine_learning
β”œβ”€β”€ etc.
These various modules have some dependencies between them, so for example, in
core
,
image_processing
may depend on
io
but not on
chemistry
. Likewise, in
data-science
, the
machine_learning
submodule may depend on
stats
but not the other way around, nor does it depend on e.g.
rasters
. (This is all geared toward geospatial computations, hence the names). It is possible to untangle these dependencies in such a way as to arrange them in a DAG although at the moment in practice there's some coupling between them. Now, the build process we have going takes the
core
monorepo and generates a wheel from it which is stored in a private index. The
requirements.txt
for this wheel is compiled using
pip-compile
with various
.in
files as inputs, and as the repo has been around for a while, it has accumulated quite a large number of requirements. For the
data-science
repo, we then add
core==whatever
to the
.in
requirements input, and then the same
pip-compile
is used to create the
requirements.txt
for
data-science
. Finally, the
data-science
build produces not a wheel but a Docker image which can be mounted in e.g. jupyterhub. So basically the goal is to get our data scientists one mega-image with the whole environment that they can use to do their work. This works pretty well, but the problem we're running into is two-fold: first, some of the components of the
data-science
monorepo are actually quite useful to other downstream projects, and it's hard to break them out independently. In my example above,
rasters
is a module for raster processing of geospatial data, and has minimal dependency on other code within
core
and no dependency on code within
data-science
. It would be ideal to let other projects simply have a
mycompany-rasters==whatever
requirement but right now that isn't possible. Second, building the monorepo requires running all of the tests in it and building a fairly huge image. For example, the
machine_learning
submodule depends on
torch
but most of our users do not use `torch`; even so they must pay the price of waiting for that image to be built and the associated tests to be run even if they're working on something that does not depend on
machine_learning
in any way. Furthermore, updating
core
requires the "PR dance" where an accompanying PR needs to be made based on the development
core
version in
data-science
to demonstrate that it doesn't break anything, then the
core
PR gets merged to master, then the downstream gets updated to depend on the master version of
core
. As you can imagine it's quite tedious and time-consuming.
Ok, I hope that sets the scene appropriately. Here's what I'm trying to accomplish and I am wondering if pants is the right tool to do it. First, I would like to have a repo in which we have two namespace projects with a structure that looks like this:
Copy code
src/core/
β”œβ”€β”€ io
β”œβ”€β”€ spatial
β”œβ”€β”€ image_processing
β”œβ”€β”€ chemistry
β”œβ”€β”€ etc.

src/data_science/
β”œβ”€β”€ rasters
β”œβ”€β”€ machine_learning
β”œβ”€β”€ stats
β”œβ”€β”€ remote_sensing
β”œβ”€β”€ etc.
Then, I would like to have internal interdependencies between the projects. For example,
rasters
depends on
core.spatial
and
<http://core.io|core.io>
but not on anything else. The goal is to be ultimately able to build
rasters
as an independent library,
mycompany-rasters
that other subprojects, such as
remote_sensing
, could depend on. Of course we also want to generate an artifact that projects external to these two monorepos can also depend on. Sorry for the wall of words, but it's a complicated setup. Ultimately my question is: is pants the right tool for accomplishing this? I have read the documentation but I'm still a bit hazy on a) how I would set up the requirements and resolvers for this kind of thing, and b) how I would encode the fact that e.g.
remote_sensing
depends on
rasters
and should be built into a wheel containing only those dependencies that are actually used in
remote_sensing
i.e. I do not want to bundle
torch
into that library because it doesn't need it.
So far, what I have tried is having a
BUILD
in the root of the project (outside the source tree) that looks like this:
Copy code
python_requirements(
    name="reqs",
    source="requirements.txt",
    resolve="reqs"
)
and then in the source tree itself (e.g.
src/core
) a
BUILD
that is something like:
Copy code
python_sources(
    name="spatial-src",
    sources=["spatial/**/*.py"],
    resolve="reqs"
)

python_distribution(
    name="spatial",
    dependencies=[":spatial-src", "//:reqs"],
    wheel=True,
    sdist=True,
    provides=python_artifact(
        name="spatial",
        version="0.0.1",
        description="spatial",
    ),
)
while this does build the
spatial
wheel, that wheel's
setup.py
includes all of the packages from the resolver, which is not what I want. Also, I don't know how to make
rasters
depend on
spatial
but that's a different question, it would be great to solve this first.
c
Remove the //:reqs from the dist target. That’s why it includes the entire resolve. Pants is smart enough to figure out what to include so only depend on you top level first party code you want to ship.
a
Thanks! I will give that a try.
c
Oh, and Hi πŸ‘‹ and welcome to our community:)
a
πŸ‘‹
Ok, amazing, thanks Andreas! That worked exactly as intended! Now for my next question, how do I get
data_science/rasters
depending on
core/spatial
? Is it the same thing where just having the
import
in the code is enough for pants to figure it out?
c
Yea, that should be enough πŸ‘
a
Indeed, that also worked: I see a
spatial==0.0.1
line in the
setup.py
of my
rasters
project. Just so I understand this better: when pants is working on the repo, it doesn't need the dependent projects to actually exist as built artifacts, right? It "just knows" that I have
spatial==0.0.1
in the tree (because I have specified it as a build target) and then figures out how to build things that depend on it?
And when it comes to updating my requirements, what would be the best way to do dependency resolution? Should I have a separate
pyproject.toml
for each of my subprojects, list the dependencies there, and then unify them somehow at the top level? Right now as I said they're already generated via
pip-compile
from
.in
files, but is there a more idiomatic way to do this in pants?
h
Correct - when you run tests or do stuff in the repo Pants operates directly from sources, and you don't need to build wheels (although that can be forced to happen if necessary, e.g., for native code)
πŸ‘ 1