I’m trying to migrate an existing monorepo to use ...
# general
g
I’m trying to migrate an existing monorepo to use pants and I’m a bit lost when trying to tie first-party dependencies between the different packages within the repo - while also dealing with requirements.txt files that declare the dependencies. More details in 🧡
Let’s say we have the following directory structure, and
package-1
depends on `package-2`:
Copy code
monorepo-example/
β”œβ”€β”€ 3rdparty
β”‚   └── python
β”‚       └── default.lock
β”œβ”€β”€ README.md
β”œβ”€β”€ packages
β”‚   β”œβ”€β”€ package-1
β”‚   β”‚   β”œβ”€β”€ BUILD
β”‚   β”‚   β”œβ”€β”€ package_1
β”‚   β”‚   β”‚   └── __init__.py
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── setup.py
β”‚   └── package-2
β”‚       β”œβ”€β”€ BUILD
β”‚       β”œβ”€β”€ package_2
β”‚       β”‚   └── __init__.py
β”‚       β”œβ”€β”€ requirements.txt
β”‚       └── setup.py
β”œβ”€β”€ pants
└── pants.toml
Before adopting pants we’ve used
pip-compile
to assemble our
requirements.txt
files so
package-2/requirements.txt
might look like this and pull from our private PyPi repository:
Copy code
pandas==1.5.0
boto3==1.22.0
package-1>=1,<2
It’s worth noting that all of our packages are being migrated from individual repos and we’d like to keep that directory structure the same with a single
BUILD
file per project and using our own
setup.py
files. Here’s what
package-2/BUILD
looks like right now:
Copy code
python_requirements(
    name="reqs",
)

python_sources(
    name="lib",
    dependencies=[
        ":reqs",
    ],
    sources=[
        "package_2/**/*.py",
        "package_2/*.py",
    ],
)

python_distribution(
    name="dist",
    dependencies=[
        ":lib",
    ],
    provides=setup_py(
        name="package-2",
    ),
    generate_setup=False,
)
Will pants determine that
package-2
depends on
package-1
based on directory structure, or do I need to add a
packages/package-1:lib
dependency to
packages/package-2:lib
? How can I keep
package-1
out of my third party dependency lock files and have those imports resolve to my other package instead?
πŸ‘€ 1
h
Quick preliminary question: what are your source roots? I.e., when
package-1
imports from
package-2
, what does the import statement look like?
g
Copy code
[source]
root_patterns = [
  "/packages/package-1/",
  "/packages/package-2/",
]
Copy code
from package_1 import foo
h
Makes sense
And that looks correct
So, there are a couple of ways to go about this:
Oh, sorry, one more question first: Do you need to publish wheels from these projects for consumption outside the monorepo?
Or in other words, what are the deployables that you create from this repo?
g
Yes, we do need to consume some wheels and Docker images from the monorepo elsewhere. Not all of the services consuming the wheels have moved into the repo yet - but in an ideal world every deployable and item consuming the deployables would live inside the repo.
h
So let’s assume we’re in the ideal world (just to get a sense of what we might be moving towards) - what are the final deployables of the repo? E.g., docker images? binaries? lambdas?
g
It would be docker images.
packages
is ending up to be a poor name for the top level directory but maybe a better directory structure looks like this where the final deployable is the
app-1
docker image (and
app-1
depends on
package-1
and
package-2
)
Copy code
monorepo-example/
β”œβ”€β”€ 3rdparty
β”‚   └── python
β”‚       └── default.lock
β”œβ”€β”€ README.md
β”œβ”€β”€ packages
β”‚   β”œβ”€β”€ app-1
β”‚   β”‚   β”œβ”€β”€ BUILD
β”‚   β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”‚   β”œβ”€β”€ app_1
β”‚   β”‚   β”‚   └── __init__.py
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── setup.py
β”‚   β”œβ”€β”€ package-1
β”‚   β”‚   β”œβ”€β”€ BUILD
β”‚   β”‚   β”œβ”€β”€ package_1
β”‚   β”‚   β”‚   └── __init__.py
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── setup.py
β”‚   └── package-2
β”‚       β”œβ”€β”€ BUILD
β”‚       β”œβ”€β”€ package_2
β”‚       β”‚   └── __init__.py
β”‚       β”œβ”€β”€ requirements.txt
β”‚       └── setup.py
β”œβ”€β”€ pants
└── pants.toml
h
Final question: today, how does package-1 consume package-2? Do you publish package-2 and list it in package-1's requirements.txt?
For background: Having lots of separate projects, each with their own requirements.txt and setup.py, and that consume each other via publishing dists, is usually a legacy of the fact that standard python tooling implicitly assumes repos that only have one, top-level, deployable, which is a dist. That ecosystem just didn’t evolve with monorepos in mind. So, if I understand correctly, what you have at the moment is a β€œmonorepo lite” if you will - separate projects that are each, effectively, their own codebase, even if they happen to be colocated in a single git repo. They consume each other as-if they were third-party deps, which involves laborious publishing and versioning, and introduces dependency hell in your first party code.
Pants supports a more frictionless approach, where you don’t need hard project boundaries, with separate setup.py and requirements.txt at all. Instead, package-1 consumes package-2 as a direct in-repo import. Pants will do the right thing when running tests and packaging deployables, based on those imports.
In this idiom you only need a setup.py when you actually need to build a dist for consumption outside the repo.
And as for requirements.txt, you can have a single one(*) for the entire repo, which enumerates the β€œuniverse” of allowed direct 3rdparty requirements for code in the repo, and pants takes the appropriate subset of those requirements as needed, again based on inferred and explicit dependencies.
This (along with a generated lockfile for that requirements.txt) prevents 3rdparty requirement conflicts in different parts of the repo, as you have one source of truth for the entire repo
(*) you can have more than one β€œuniverse” if your repo truly requires conflicting deps in different parts, via the named resolves feature, but it’s always better if you don’t need that.
You can, alternatively, continue to work as you do today. In this case, Pants can generate
setup.py
data for you, including the requirements. So, e.g., Pants sees that package-1 depends on package-2 (because of an import), and knows that package-2 is published as
package-2==1.2.3
, and so adds that to package-1's
setup.py
as a requirement, along with the third-party requirements package-1 imports from.
It’s on my TODO list to wrote a post about all this
g
That’s right, today
package-1
consumes
package-2
by including its dist in the
requirements.txt
We have some challenges with publishing and versioning - we currently handle this with via semantic-release generating release tags like
package-2 v1.2.3
. But for us collocating the packages into the same repo helps to find breaking changes and update inter-dependencies easier. This all makes sense. I guess I was thinking of some kind of happy middleground between the two approaches. For example:
package-1
could consume
package-2
via its dist but still know that when code changes in
packages/package-2/package_2
that it needs to re-run unit tests for both. In the second scenario you outlined where pants decides on dependencies and generates its own setup.py - will pants always use the latest version of the dependent distribution? I’m thinking of a scenario where I want to make a breaking change in
package-2
and slowly roll out that release to the affected dependencies also managed by pants (we’re slowly doing this with things like Python Major Version and PyTorch upgrades). Maybe that goes against the principle of the managed monorepo though and I’m just thinking about it wrong 🀷