Hi All, :wave: Firstly, thank you for developing ...
# general
l
Hi All, 👋 Firstly, thank you for developing a Python focused build system and for a thanks in advance for any comments and thoughts you might have for my questions. 🎉 🙏 I have been recently exploring Pants for a a Python monorepo that is intended for data engineering purpose. The repository intends to container ETL processing spark application along with their respective Airflow DAGs for multiple data assets. Motivations for choosing to have a monorepo are as following: • Maintain code owned by a team under a single repository • Use consistent tooling across the lifecycle of all the code owned by a team • Abstract away common code as a shared library for all other projects in the monorepo • Have the flexibility of pulling out a project in the monorepo to another repository in case of a change in ownership of a certain data assets (is quite common). Requirements we have for ourselves are as following: • Each project within the monorepo is to be published to an internal PyPi repository along with a packaged tar/zip to Amazon S3 (this gets used within our Airflow instance) • Each project within the monorepo is independent of each other. The only dependency we have is on the common shared library (at least so far) • Each project needs to follow semantic versioning (so far we have been using it along with conventional commits) • We want to keep using Python virtual environment and Python editable installs for the development of a project • We want to use Poetry • We want to keep using VS Code as our primary IDE • We are hosting our monorepo in a GitHub repository and leverage GitHub Actions. Open questions we have are following: 🤔 • Firstly, what is the recommended development granularity for such a monorepo? Do we always have the entire monorepo open in our IDE or do we open on the relevant project we need to develop for? This has implications in the sense of where to declare dependencies for a project (see the next point), use Pants to export virtual environments vs. use Poetry to setup virtual environments. • Currently our dependencies within a project have a high overlap with each other. What tradeoffs do we need to evaluate for a per project dependency management vs. global dependency management on the monorepo level? • We also need help in setting up the “commons” library such that it is available as an dependency for all our projects and is available as part of the project’s virtual environment. It should also be included in the bundled package for each project. How can this be achieved? • We currently see that we will have to bump project versions manually. Is there an already included feature to auto update versions? And what is the recommendation when wanting to have semantic versioning? Because the current approach with conventional commits is not straightforward in the case of a monorepo, as semantic versioning libraries today are not monorepo aware… For reference, The structure for the repository is as following:
Copy code
.
├── BUILD
├── data_lake
│  ├── commons
│  │  ├── commons/
│  │  ├── poetry.lock
│  │  ├── poetry.toml
│  │  ├── pyproject.toml
│  │  ├── README.md
│  │  └── tests/
│  ├── data-asset-1
│  │  ├── BUILD
│  │  ├── data-asset-1/
│  │  ├── lockfile.txt
│  │  ├── pyproject.toml
│  │  ├── README.md
│  │  └── tests/
│  └── data-asset-2
│     ├── BUILD
│     ├── lockfile.txt
│     ├── poetry.lock
│     ├── pyproject.toml
│     ├── README.md
│     ├── data_asset_2/
│     └── tests/
├── dist
│  └── export
│     └── python
├── Justfile
├── lockfile.txt
├── pants.toml
├── pyproject.toml
└── README.md
The contents of pants.toml are as following
Copy code
[GLOBAL]
# TODO: try out ruff lint + format
pants_version = "2.20.0dev7" 

backend_packages = [
  "pants.backend.python",
  "pants.backend.python.typecheck.mypy",
  "pants.backend.experimental.python",
  "pants.backend.experimental.python.lint.ruff",
]

[python]
enable_resolves = true
default_resolve = "default"
interpreter_constraints = [">=3.11,<3.12"]

[python-bootstrap]
search_path = ["<ASDF>"]

[python.resolves]
default = "lockfile.txt"
commons = "data_lake/commons/lockfile.txt"
data_asset_1 = "data_lake/data_asset_1/lockfile.txt"
data_asset_2 = "data_lake/data_asset_2/lockfile.txt"

[python-repos]
indexes = ["intenral repository"]

[python-infer]
use_rust_parser = true

[pytest]
args = ["-vv"]

[export]
py_resolve_format="mutable_virtualenv"
resolve = [
  'commons',
  'data_asset_1',
  'data_asset_2',
  ]

[source]
marker_filenames = ["pyproject.toml"]
The contents of data_lake/data_asset_1/data_asset_1/BUILD are as following
Copy code
poetry_requirements(
    name="poetry",
    source="pyproject.toml",
    resolve="data_asset_1",
)

files(
    name = "data_asset_1_files",
    sources = ["data_asset_1/**", "README.md"],
)

file(
    name="pyproject_toml",
    source="pyproject.toml",
)

python_distribution(
    name = "data_asset_1",
    dependencies = [
        ":poetry",
        ":data_asset_1_files",
        ":pyproject_toml",
    ],
    provides = python_artifact(),
    generate_setup = False,
    repositories = [
        "internal repository"
    ]
)
f
Hi @limited-ghost-8212 We have been running a python / rust / docker monorepo for over a year now, and have everything almost in order. The major points are • we use pants in the CI/CD pipeline (and for expert developers, their workstation) • we try to ensure "what happens in the pipeline, can be replicated on the workstation" (pants is the fix for that) • we have most engineers on linux, and some on windows, and "deployment" systems are windows (users) and linux (servers) ◦ the windows users are very much like data scientists - they need our library to do stuff. • We "multi version" using conventional commits, semver - of all "publishable" packages. ◦ this is complex - but the principal is ▪︎ everything that needs to publish with a version, add to their BUILD, version_and_changelog() <-- custom ▪︎ this behind the scenes operates via "pants run //tgt/address:bump_version" - and then test/package can run etc • versioning is handled by convention-commits, convco *( a rust CLI which has support for monorepos) • the version is "taggged" as
tags/path.to/thing/that/publishes/vX.Y.Z
❤️ 2
The monorepo has everything ; IaaC, Pipeline code, libraries, inventory for cloud, docker containers, data sci . etc
❤️ 1
The answers to your questions are my opinion and experience only • > Do we always have the entire monorepo open in our IDE both, for the pipeline people, all the repo, for the experienced, all the repo for the python, I have found it best to open in the subfolders, esp with VSCode + poetry, and .venv in the folder, means it all works
❤️ 1
• We want to keep using Python virtual environment and Python editable installs for the development of a project
yes we do this
• We want to use Poetry
yes - for windows, because pants does not work performantly (WSL) and we have some custom tasks which were written in sh, we use poetry
• We want to keep using VS Code as our primary IDE
Yes 60% VSCode, 40% PyCharm (me VSCode)
We also need help in setting up the “commons” library such that it is available as an dependency for all our projects and is available as part of the project’s virtual environment. It should also be included in the bundled package for each project. How can this be achieved?
Pants does this naturally - that is • either - the user is using "in repo" so it's not versioned, but just available in a venv, from an install • we have not exercised this need much yet Pants "publishing" the library, means that an outsider can just depend upon it.
Currently our dependencies within a project have a high overlap with each other. What tradeoffs do we need to evaluate for a per project dependency management vs. global dependency management on the monorepo level?
we tried to use one python-resolve. but some projects needed a different python (3.10, 3.12, 3.11 ) so that failed. and we had to have two / three resolves. YMMV
We currently see that we will have to bump project versions manually. Is there an already included feature to auto update versions? And what is the recommendation when wanting to have semantic versioning? Because the current approach with conventional commits is not straightforward in the case of a monorepo, as semantic versioning libraries today are not monorepo aware…
as mentioned - we are using Conventional-commits, Semver. I have a strong opnion, that the VERSION does not matter, but having one is crucial - it is a communication and operational mechanism to save time and have clear planning. so it is needed. but what it is - does not matter, so handing that job to conv.commits and then letting versioning be in the hands of the developers (via commit) is awesome.
❤️ 1
l
Thank you @fresh-continent-76371 for investing the time to respond…! 🙏 Reading through now…
Thanks again @fresh-continent-76371…! Really useful insights that help me navigate our tradeoffs better than before I am curious about two points you mentioned
• We “multi version” using conventional commits, semver - of all “publishable” packages.
◦ this is complex - but the principal is
▪︎ everything that needs to publish with a version, add to their BUILD, version_and_changelog() <-- custom
▪︎ this behind the scenes operates via “pants run //tgt/address:bump_version” - and then test/package can run etc
• versioning is handled by convention-commits, convco *( a rust CLI which has support for monorepos)
• Had a quick read of the convco documentation and couldn’t figure out how monorepo versioning is supported. Probably need to look again. • It would be great if you can elaborate a bit more about your setup here. I am guessing you have some custom scripts to glue things together… • When you say
add to their BUILD
, do you mean updating
python_distribution.provides.python_artifact.version
? • When you say
version_and_changelog()
, is this something Pants offers?
Pants does this naturally - that is
• either - the user is using “in repo” so it’s not versioned, but just available in a venv, from an install
The aim here is to have a shared library package that is published to our internal PyPi. Additionally, when running
pants package data_lake/data_asset_1
, I want the shared library to also get packaged along with data_asset_1 source. This is an optimisation I am wanting to do to reduce the effort of using a python package with as artifacts submitted to an Apache Spark cluster. Rather than providing data_asset_1.tar.gz and commons.tar.gz, I am wanting to provide only data_asset_1.tar.gz with the latest commons already part of this tar. • My struggle currently is getting the pants configuration right for this. 😞 • I was not able to make pants include the shared library to all project venvs that get created as part
pants export
and to all project packages that get created as part of
pants package ::
• If you have made it work, maybe share how are you achieving it so far?
Happy Monday all, Would be great if some more folks can pitch in and share their thoughts…!
1
c
for packaging, I'd suggest looking into the
pex_binary
target. that will create a distributable file that includes everything you need in a single file (except the python interpreter). (it is a executable archive, so you can inspect/unpack it as desired using your normal tools, pex also have tools for creating (installing) venvs on disk from a pex archive.)
🙏 1
f
Note @limited-ghost-8212 - that pex binaries don't work on Windows ; we have not tracked it deeply, but when we first tried hit that problem, so for the windows endpoints we deliver wheels and they are a mix of conda and poetry environments. I am unsure if this technical position has changed at all
👍 1
Had a quick read of the convco documentation and couldn’t figure out how monorepo versioning is supported. Probably need to look again.
To understand how convco is used used, first we need to understand the versioning "rules" I applied to the repository, I added the "repo" tagging strategy of the following - Every versionable thing is a package, in a directory
Copy code
./apps/arc/VERSION
  ./apps/devolver/VERSION
  ./apps/log_manager/VERSION
  ./lib/finder_2000/VERSION
  ./lib/knohh/VERSION
  ./tools/containers/cicd-general/VERSION
  ./tools/python/acme_cicd/VERSION
Taking a docker image as an example, it may be made up of 3 wheels, and a rust binary, from within the pants monorepo. These dependencies may be versioned, or not versioned, it does not matter. for the Dockerfile - we see lines like
Copy code
COPY /*.whl /tmp
RUN pip install /tmp/*.whl && rm /tmp/*.whl
- Every version gets a tag (at a point in it's lifecycle) - merge2main, MR/PR, maintenance branches the tags look like (borrowing the example above) -
apps/arc/v4.5.1, lib/knohh/v0.2.33
this is what you see when you run
git tag -l
(amongst other tags) - The existence of a custom pants goal (:bump_version) tells you what should be versioned. (you can't look for "VERSION" files because there may not be one yet). My initial method of "what should be versioned" was to look for pants "packages" and version them all (that turns out not to be true) and it is certainly tidier to "mark" the package-able thing as "version this please" - A tag, represents the "bundled" package, including all it's files and dependencies In pseudo code, this looks like if the directory (
apps/arc
has any changes) then bump the version practically, we ask pants this question, which in turn can include the transitive dependencies (neat!) -breath - now we can look at how convco is operated. convco supports two features --prefix and --paths -
--prefix
-
convco version --prefix apps/arc/v  ...
tells convco, to look for a tag of that form -
--paths
-
convco version --prefix apps/arc/v --paths apps/arc
tells convco, hunt the commits, for conventional commits only where these commits included changes to files in that (these paths).
It would be great if you can elaborate a bit more about your setup here. I am guessing you have some custom scripts to glue things together…
Very, and my goal would be to push via contribution, or example, this setup so people can see it working. There was once a statement that said you can't / or should not version out of a monorepo, (so i took that as a challenge) and so far it has proved not to be true.
When you say add to their BUILD, do you mean updating python_distribution.provides.python_artifact.version ?
Yes - i have a plugin that does that.
When you say version_and_changelog(), is this something Pants offers?
No this is a macro, that adds "two targets" to the BUILD file, which are ":bump_version" and ":gen_changelog"
Copy code
def version_and_changelog():
    # this runs convco (inside a python script), and determines the next version of the package
    # the PWD of this command is the "directory" of the BUILD file.
    # the PANTS_BUILDROOT_OVERRIDE is the root of the workspace.
    # ${PWD#$PANTS_BUILDROOT_OVERRIDE/} will equal the relative path to the versionable thing

    # python $PANTS_BUILDROOT_OVERRIDE/tools/python/acme_cicd/src/acme_cicd/versioning_ops.py version lib/jlab VERSION

    #
    run_shell_command(
        name="bump_version",
        #
        command="python $PANTS_BUILDROOT_OVERRIDE/tools/python/acme_cicd/src/acme_cicd/versioning_ops.py version ${PWD#$PANTS_BUILDROOT_OVERRIDE/} VERSION",
    )

    run_shell_command(
        name="gen_changelog",
        command="python $PANTS_BUILDROOT_OVERRIDE/tools/python/acme_cicd/src/acme_cicd/versioning_ops.py changelog ${PWD#$PANTS_BUILDROOT_OVERRIDE/} CHANGELOG.md",
    )
either - the user is using “in repo” so it’s not versioned, but just available in a venv, from an install
The aim here is to have a shared library package that is published to our internal PyPi.
Additionally, when running pants package data_lake/data_asset_1, I want the shared library to also get packaged along with data_asset_1 source.
This is an optimisation I am wanting to do to reduce the effort of using a python package with as artifacts submitted to an Apache Spark cluster. Rather than providing data_asset_1.tar.gz and commons.tar.gz, I am wanting to provide only data_asset_1.tar.gz with the latest commons already part of this tar.
This works - yes. We don't let developers push to the package repo (like normal - that is danger) - all things go through CI - but - they can build a wheel, or image, and it's versioned for them (
6.1.2-dev.cafe67677
) just the same
My struggle currently is getting the pants configuration right for this. 😞
It is not easy, but you're query does push me to make a public repo example.
I was not able to make pants include the shared library to all project venvs that get created as part pants export and to all project packages that get created as part of pants package ::
This reads to me like - pipeline publishes and versions shared-library and all downstream libaries, together as separate or same versions There are some complications, don't be thinking it is all roses - it is not, but as we have worked through the issues, our team has become faster because, simply; and we reduced the support burden because we have provenance, because we version all things :-D
👀 1
❤️ 1