So I'm finally getting down to business and lookin...
# plugins
f
So I'm finally getting down to business and looking at a way to integrate pants with an RPM-based system. I see several challenges to getting this to work, and I want to start laying them out here so I can understand the problem space before I start hacking away: 1. How can I tell pants how python modules can be provided by these system packages, as opposed to pip-installable packages? 2. How will I be able to run build steps with these (and only) system packages installed? Containers? Chroot? 3. If I use one of these solutions for build steps, I imagine I'll need to run multiple commands to set up and clear the build environment. Is this compatible with the way Pants executes rules now?
f
FYI there is prior art in Pants v1. I wrote an RPM plugin for Pants while at Foursquare: https://github.com/foursquare/fsqio/tree/master/src/python/fsqio/pants/rpmbuild
🙏🏻 1
It’s been long enough that I don’t remember much about it, and obviously v2 != v1, but some of how that plugin handled RPM builds might point at answers for your questions.
the plugin used a docker container to run the RPM builds using a standard template: https://github.com/foursquare/fsqio/blob/master/src/python/fsqio/pants/rpmbuild/tasks/dockerfile_template.mustache
p
I think someone else was looking at doing a plugin for deb files. Implementation wise, I think it would be good if we can leverage
fpm
so that you can at least build the RPMs on any platform, not just on RPM platforms.
fpm
can be leveraged for both deb and rpm. So, more shared stuff and less to maintain for each system package type we add.
f
I think this could only be done at a really, really high level, or perhaps for a restricted set of use cases. There's quite a bit of difference between deb and rpm fundamentally, and then figuring out two dependency chains seems complicated as well
I mean, if your goal is to simply package apps into os-level packages, I think FPM could work for that use case. But I'm looking to leverage RPM ecosystem at every step in the build process (build, test, run, package) and to manage dependencies in terms of RPM repo streams. I don't think it's the same use case. I could be wrong though. What would you like to see in something like this?
p
Yes. a lot of the target/rule logic will differ, but they can share an FpmSubsystem.
Eventually, I want to build RPMs and debs. But, I want to package up a virtualenv (not a pex) in the system package. So, I guess I’m bypassing a lot of the issues integrating with system python.
For question number 2: I wonder how much control the remote execution API provides for system package management, or if there are any guarantees about which system is used to run your build. Locally, pants can run on MacOSX and so might need docker desktop or other additional infra to be able to run such pants build steps. But remotely? I wonder about REAPI and how it will affect the design here.
f
remote execution currently requires the platform of the host and remote executor to match. this is a known limitation in the Pants REAPI client. (of which I’ve made a lot of the recent changes to, so feel free to ask me about it)
p
I haven’t actually used the remote stuff yet. So, if I’m running pants locally on a darwin system, the remote executor has to be darwin as well? Not linux?
f
yes although unclear if that has ever been tried in practice since I don’t believe any of the existing REAPI servers work on macOS, just linux
p
oh. That’s going to be an interesting rabbit hole once I get there… But I think I’ve digressed from the OP topic.
f
it’s annoying since when I test the Pants REAPI client on my macOS laptop I have to start a Linux VM and run both the server and Pants client in the VM
🤦 1
p
Do the distros have to match as well? EL vs debian based?
f
They shouldn’t at least for Python since pex does its interpreter selection running in the remote execution environment.
f
Yeah the whole issue with system packaging, afaict, there's no way to make a system appear to have an arbitrary set of system packages without involving some kinda of chroot mechanism (with containers being a much more complete chroot mechanism)
f
but once you build a pex with a native C module and try to use it on the other platform, boom
đź’Ą 1
we do mix the platform into the cache key so not an issue, but pex sometimes would be a “platform: none” and technically usable on both platforms
although I’m blanking on the particular form of how the platform incompatibilities showed up, so don’t take my word for it on the “native C module” reference. could have been something else.
p
Well, I am very interested in this topic… I really want to make a mess of ruby go 💥. That mess of ruby is used to marshall building debs and rpms in containers. It’s an awful mess.
f
but easy enough to replicate the incompatibility, merely by spinning up one of the servers in a docker container and pointing Pants at it.
my v1 RPM plugin ran
docker
directly. if the remote execution environment could run
docker
, then that approach probably could still work (since the Pants v2 rule would just be asking for a
Process
to be run)
but if the remote executor itself is running in a docker container, now you have docker-in-docker which is … fun
f
DinD can be avoided 99% of the time just be re-using the host's docker daemon by mounting the docker binary and daemon socket
Reading this (rpmbuild_task.py) now @fast-nail-55400. I don't really know how caching worked in v1. Is the workunit thing cache-related? Or were you just relying on the local docker daemon to cache builds itself?
đź‘€ 1
f
yeah looks like it relied on just orchestrating the docker invocation, so if you were iterating on a package, then it would be docker caching that would speed builds up
that would be the case even in v2 if all the rule did was write a Dockerfile and invoke docker
those layers wouldn’t end up in the REAPI CAS
(similar issue in CI systems with trying to cache Docker layers built by a CI job)
I wonder how much of the the Pants v2 docker support would be usable to the effort
those layers wouldn’t end up in the REAPI CAS
at least not without an
docker export
or some other equivalent (skopeo is actually a pretty good tool for stuff like that). also by “REAPI CAS” I also mean Pants’ local cache
f
Yeah what I think would be interesting here is being able to export docker layers and metadata in a way the CAS can deal with
Kinda interested in buildah for this, in that it's kinda a throwback to a command-and-commit style of building container images
f
when I was researching a remote execution product for Toolchain,
umoci
(https://github.com/opencontainers/umoci) was very useful for image manipulation.
(and https://github.com/containers/skopeo was useful for image downloads and conversion). both deal with the image directly without involving having to spin up a docker container.
f
thanks, being able to make this work without an actual docker daemon is a requirement for me
f
looking at the Pants v2 docker subsystem, look like it still invokes docker
f
yeah that's not the wave for my use case (and I'll need to describe this use case better)
i'm actually less interested in building containers for consumption as artifacts than I am in using containers to create hermetic environments for build steps that depend on system-level packaging constructs
f
including in remote execution?
f
eventually yeah
as it stands, iiuc, pants does hermetic runs by creating a temp process execution dir, copying whatever files need to be copied in there, running the command, and then copying out result files that it needs to pick up, and caches those
f
correct, at least locally. for remote execution, that will depend on what the particular server in use does and what can be configured.
but similar concept, temp directory that is wiped away once the outputs are captured to the CAS
f
well, I'll focus on local for a sec, for the sake of discussion
Let's say instead of a tmp dir, I used a new container as the basis of my execution environment; and when the process(es) completed, I could either commit that writable layer to a new image that I could export, or I could copy specific files out of it. Either way, I have some immutable result that could go into CAS
Does that make sense?
f
yes. note though at least in the Pants/REAPI execution model, you would still have a tmpdir for the build action but it run be a bash script with the tool invocations needed to invoke the container and then ensure the (let’s say OCI image format) image is available in the tmpdir for capture to the CAS
since CAS capture requires the output to be in the action’s tmpdir
(and future invocations can then load that output in the next action’s input root by referencing its digest)
although that execution model is not that performant currently if the input root’s size is really large due to cost of writing to disk
we hit that with the Go plugin when we originally supported mounting the Go SDK into the input root
and container images conceivably will be really large
(although the “append-only” cache feature of the Pants execution model could help there, i.e. “named caches”)
f
there's probably a lot of things I could do for performance if it comes to that...
f
but I agree with your idea of manipulating the image as its own entity
it’s why I ended up using
umoci
for example for the Toolchain research. we could unpack an image to disk, modify it in unpacked form as just a filesystem layout (in OCI format), and repack it without any docker invocation.
👍🏻 1
f
noted
so this at least answers question #3: dump a script into the process execution dir that invokes the steps you need and makes sure the output is there to be captured
and probably question #2 will be answered by playing with container tools to find something that has reasonable performance for this purpose
but for #1... I guess I should provide an example of what I mean...
f
re #1, would a concept similar to maven’s jar “scope” be relevant?
i.e., “provided” scope
f
So what if I want to tell Pants the the python
requests
module comes not from PyPI but from the fedora
python3-requests
package? And how could I hook that notion into the depedency inference system?
I'm not that familiar with maven, but I'll google that
f
(for maven, a jar with “provided” scope is included in the Java classpath for compiing but is excluded from the “runtime” classpath as already provided by the deployment environment)
f
hmm, that's not quite the same thing, although that might be a concept to think about in this whole design process
f
for #1, can you clarify what code is trying to use
requests
and how is it being packaged into the RPM?
for example, you should be able to build a pex and embed the pex into the RPM without having to solve
tell Pants the the python requests module comes not from PyPI but from the fedora python3-requests package
which implies that there is non-pex-packaged Python code in the RPM
f
I'm not looking to build pexes at all
f
okay but then re the question, what target would need that dependency inference?
f
My ultimate build artifacts will likely be sets of RPMs. Some of the inputs to building those will need to be knowledge of their dependencies
f
probably depends on the type of “source” being fed to RPM then?
f
So if I want to build an RPM for a python module that just repeatedly uses
requests
to ping
<https://httpstat.us/200>
, I'll need the python file that does that, plus the metadata that that RPM depends on
python3-requests
f
which would have been encoded in the .spec file. are you proposing that Pants write the RPM spec instead of the spec being still written by the developer?
(for my Pants v1 plugin, the spec remained hand-written, rpm’s own dependency scan still applied for generating dependencies)
f
Yes, the spec would be a template that would get filled in by build metadata determined by this Pants plugin
f
maintain a static mapping of PyPi module name to RPM package name?
f
like a separate module_mapping?
f
yeah
or have a default rule and use a module_mapping table as an override)
f
Makes sense
f
also this really isn’t pants dep inference, since you aren’t injecting a dep on another target. This is just this plugin filling in the RPM spec.
(putting aside deps between RPM packages managed by Pants)
f
but the plugin would need to use pants inferred deps as input to this mapping
i guess I just need the rule that captures that output...
like is there a rule output type that captures the notion "these are the set of imports found in the first-party code targeted by this run"
f
Yes. Use
DependenciesRequest
or
TransitiveTargetsRequest
and filter down to python requirement targets.
🙏🏻 1
f
awesome... this is coming into shape
Thanks for helping, and letting me bounce these ideas off you. I expect this to be a really ambitious but worthwhile undertaking. And I'm a lot more confident in my current company that this could become something we could open source if it works out.
👍 1
Maybe the RPM stuff isn't the most portable for everyone, but the ability to run build steps in containers and then use the images as output could be a game-changer for interacting with system packaging and other native code