adventurous-greece-2061
06/20/2024, 3:22 PMadventurous-greece-2061
06/20/2024, 3:23 PMcore
and data-science
. Their source trees look basically like this:
core
βββ io
βββ spatial
βββ image_processing
βββ chemistry
βββ etc.
data-science
βββ rasters
βββ stats
βββ machine_learning
βββ etc.
These various modules have some dependencies between them, so for example, in core
, image_processing
may depend on io
but not on chemistry
. Likewise, in data-science
, the machine_learning
submodule may depend on stats
but not the other way around, nor does it depend on e.g. rasters
. (This is all geared toward geospatial computations, hence the names). It is possible to untangle these dependencies in such a way as to arrange them in a DAG although at the moment in practice there's some coupling between them.
Now, the build process we have going takes the core
monorepo and generates a wheel from it which is stored in a private index. The requirements.txt
for this wheel is compiled using pip-compile
with various .in
files as inputs, and as the repo has been around for a while, it has accumulated quite a large number of requirements. For the data-science
repo, we then add core==whatever
to the .in
requirements input, and then the same pip-compile
is used to create the requirements.txt
for data-science
. Finally, the data-science
build produces not a wheel but a Docker image which can be mounted in e.g. jupyterhub. So basically the goal is to get our data scientists one mega-image with the whole environment that they can use to do their work.
This works pretty well, but the problem we're running into is two-fold: first, some of the components of the data-science
monorepo are actually quite useful to other downstream projects, and it's hard to break them out independently. In my example above, rasters
is a module for raster processing of geospatial data, and has minimal dependency on other code within core
and no dependency on code within data-science
. It would be ideal to let other projects simply have a mycompany-rasters==whatever
requirement but right now that isn't possible. Second, building the monorepo requires running all of the tests in it and building a fairly huge image. For example, the machine_learning
submodule depends on torch
but most of our users do not use `torch`; even so they must pay the price of waiting for that image to be built and the associated tests to be run even if they're working on something that does not depend on machine_learning
in any way. Furthermore, updating core
requires the "PR dance" where an accompanying PR needs to be made based on the development core
version in data-science
to demonstrate that it doesn't break anything, then the core
PR gets merged to master, then the downstream gets updated to depend on the master version of core
. As you can imagine it's quite tedious and time-consuming.adventurous-greece-2061
06/20/2024, 3:23 PMsrc/core/
βββ io
βββ spatial
βββ image_processing
βββ chemistry
βββ etc.
src/data_science/
βββ rasters
βββ machine_learning
βββ stats
βββ remote_sensing
βββ etc.
Then, I would like to have internal interdependencies between the projects. For example, rasters
depends on core.spatial
and <http://core.io|core.io>
but not on anything else. The goal is to be ultimately able to build rasters
as an independent library, mycompany-rasters
that other subprojects, such as remote_sensing
, could depend on. Of course we also want to generate an artifact that projects external to these two monorepos can also depend on.
Sorry for the wall of words, but it's a complicated setup. Ultimately my question is: is pants the right tool for accomplishing this? I have read the documentation but I'm still a bit hazy on a) how I would set up the requirements and resolvers for this kind of thing, and b) how I would encode the fact that e.g. remote_sensing
depends on rasters
and should be built into a wheel containing only those dependencies that are actually used in remote_sensing
i.e. I do not want to bundle torch
into that library because it doesn't need it.adventurous-greece-2061
06/20/2024, 3:29 PMBUILD
in the root of the project (outside the source tree) that looks like this:
python_requirements(
name="reqs",
source="requirements.txt",
resolve="reqs"
)
and then in the source tree itself (e.g. src/core
) a BUILD
that is something like:
python_sources(
name="spatial-src",
sources=["spatial/**/*.py"],
resolve="reqs"
)
python_distribution(
name="spatial",
dependencies=[":spatial-src", "//:reqs"],
wheel=True,
sdist=True,
provides=python_artifact(
name="spatial",
version="0.0.1",
description="spatial",
),
)
while this does build the spatial
wheel, that wheel's setup.py
includes all of the packages from the resolver, which is not what I want. Also, I don't know how to make rasters
depend on spatial
but that's a different question, it would be great to solve this first.curved-television-6568
06/20/2024, 3:38 PMadventurous-greece-2061
06/20/2024, 3:38 PMcurved-television-6568
06/20/2024, 3:38 PMadventurous-greece-2061
06/20/2024, 3:39 PMadventurous-greece-2061
06/20/2024, 3:42 PMdata_science/rasters
depending on core/spatial
? Is it the same thing where just having the import
in the code is enough for pants to figure it out?curved-television-6568
06/20/2024, 4:02 PMadventurous-greece-2061
06/20/2024, 4:39 PMspatial==0.0.1
line in the setup.py
of my rasters
project. Just so I understand this better: when pants is working on the repo, it doesn't need the dependent projects to actually exist as built artifacts, right? It "just knows" that I have spatial==0.0.1
in the tree (because I have specified it as a build target) and then figures out how to build things that depend on it?adventurous-greece-2061
06/20/2024, 4:42 PMpyproject.toml
for each of my subprojects, list the dependencies there, and then unify them somehow at the top level? Right now as I said they're already generated via pip-compile
from .in
files, but is there a more idiomatic way to do this in pants?happy-kitchen-89482
06/22/2024, 11:33 AM