adventurous-greece-2061
06/20/2024, 3:22 PMadventurous-greece-2061
06/20/2024, 3:23 PMcore and data-science . Their source trees look basically like this:
core
βββ io
βββ spatial
βββ image_processing
βββ chemistry
βββ etc.
data-science
βββ rasters
βββ stats
βββ machine_learning
βββ etc.
These various modules have some dependencies between them, so for example, in core, image_processing may depend on io but not on chemistry. Likewise, in data-science, the machine_learning submodule may depend on stats but not the other way around, nor does it depend on e.g. rasters. (This is all geared toward geospatial computations, hence the names). It is possible to untangle these dependencies in such a way as to arrange them in a DAG although at the moment in practice there's some coupling between them.
Now, the build process we have going takes the core monorepo and generates a wheel from it which is stored in a private index. The requirements.txt for this wheel is compiled using pip-compile with various .in files as inputs, and as the repo has been around for a while, it has accumulated quite a large number of requirements. For the data-science repo, we then add core==whatever to the .in requirements input, and then the same pip-compile is used to create the requirements.txt for data-science. Finally, the data-science build produces not a wheel but a Docker image which can be mounted in e.g. jupyterhub. So basically the goal is to get our data scientists one mega-image with the whole environment that they can use to do their work.
This works pretty well, but the problem we're running into is two-fold: first, some of the components of the data-science monorepo are actually quite useful to other downstream projects, and it's hard to break them out independently. In my example above, rasters is a module for raster processing of geospatial data, and has minimal dependency on other code within core and no dependency on code within data-science. It would be ideal to let other projects simply have a mycompany-rasters==whatever requirement but right now that isn't possible. Second, building the monorepo requires running all of the tests in it and building a fairly huge image. For example, the machine_learning submodule depends on torch but most of our users do not use `torch`; even so they must pay the price of waiting for that image to be built and the associated tests to be run even if they're working on something that does not depend on machine_learning in any way. Furthermore, updating core requires the "PR dance" where an accompanying PR needs to be made based on the development core version in data-science to demonstrate that it doesn't break anything, then the core PR gets merged to master, then the downstream gets updated to depend on the master version of core. As you can imagine it's quite tedious and time-consuming.adventurous-greece-2061
06/20/2024, 3:23 PMsrc/core/
βββ io
βββ spatial
βββ image_processing
βββ chemistry
βββ etc.
src/data_science/
βββ rasters
βββ machine_learning
βββ stats
βββ remote_sensing
βββ etc.
Then, I would like to have internal interdependencies between the projects. For example, rasters depends on core.spatial and <http://core.io|core.io> but not on anything else. The goal is to be ultimately able to build rasters as an independent library, mycompany-rasters that other subprojects, such as remote_sensing, could depend on. Of course we also want to generate an artifact that projects external to these two monorepos can also depend on.
Sorry for the wall of words, but it's a complicated setup. Ultimately my question is: is pants the right tool for accomplishing this? I have read the documentation but I'm still a bit hazy on a) how I would set up the requirements and resolvers for this kind of thing, and b) how I would encode the fact that e.g. remote_sensing depends on rasters and should be built into a wheel containing only those dependencies that are actually used in remote_sensing i.e. I do not want to bundle torch into that library because it doesn't need it.adventurous-greece-2061
06/20/2024, 3:29 PMBUILD in the root of the project (outside the source tree) that looks like this:
python_requirements(
name="reqs",
source="requirements.txt",
resolve="reqs"
)
and then in the source tree itself (e.g. src/core) a BUILD that is something like:
python_sources(
name="spatial-src",
sources=["spatial/**/*.py"],
resolve="reqs"
)
python_distribution(
name="spatial",
dependencies=[":spatial-src", "//:reqs"],
wheel=True,
sdist=True,
provides=python_artifact(
name="spatial",
version="0.0.1",
description="spatial",
),
)
while this does build the spatial wheel, that wheel's setup.py includes all of the packages from the resolver, which is not what I want. Also, I don't know how to make rasters depend on spatial but that's a different question, it would be great to solve this first.curved-television-6568
06/20/2024, 3:38 PMadventurous-greece-2061
06/20/2024, 3:38 PMcurved-television-6568
06/20/2024, 3:38 PMadventurous-greece-2061
06/20/2024, 3:39 PMadventurous-greece-2061
06/20/2024, 3:42 PMdata_science/rasters depending on core/spatial? Is it the same thing where just having the import in the code is enough for pants to figure it out?curved-television-6568
06/20/2024, 4:02 PMadventurous-greece-2061
06/20/2024, 4:39 PMspatial==0.0.1 line in the setup.py of my rasters project. Just so I understand this better: when pants is working on the repo, it doesn't need the dependent projects to actually exist as built artifacts, right? It "just knows" that I have spatial==0.0.1 in the tree (because I have specified it as a build target) and then figures out how to build things that depend on it?adventurous-greece-2061
06/20/2024, 4:42 PMpyproject.toml for each of my subprojects, list the dependencies there, and then unify them somehow at the top level? Right now as I said they're already generated via pip-compile from .in files, but is there a more idiomatic way to do this in pants?happy-kitchen-89482
06/22/2024, 11:33 AM