What are the reasons for not allowing/supporting i...
# development
w
What are the reasons for not allowing/supporting inference on sources in
docker_image
targets?
Changing this: https://github.com/pantsbuild/pants/blob/main/src/python/pants/backend/docker/util_rules/docker_build_context.py#L273 To just not filter would be nice. I guess there's a blocker in the codegen_enabled=True bit, but surely there's a path forward there?
b
Fundamentally, pants has assumptions baked in. I think the assumption here was that you dockerize your applications, and those get packaged, so we infer packaged things. The more nuanced answer is that if we allowed sources it'd be difficult to make sure the sandbox has what youd expect for package dependencies, since we'd expose the package and all the input sources. I think it's a solvable problem, but takes consideration and work
w
The more nuanced answer is that if we allowed sources it'd be difficult to make sure the sandbox has what youd expect for package dependencies, since we'd expose the package and all the input sources.
How so? packages end up in
DistDir
afaict, there's no file-level ambiguity when collecting docker context, right? If a dockerfile contains a
COPY
that matches an addresses filepath, I dont see how that could be misinterpreted? To be fair you can do what I want by just `file`:ing everything.
b
Pants internals doesn't package into dist for docker contexts though. It's relative to the build root, just like sources. Oh right and then there's the issue of file v resource šŸ™ƒ
w
Pants internals doesn't package into dist for docker contexts though
So it does, bummer! Although, packaging has the convention of generating a dotted string as a containing directory, chances of collisions with sources ought to be slim, and detectable (warn or error), so I still don't see this as a showstopper, particularly?
Oh right and then there's the issue of file v resource
I understand that python wants to differentiate these, I dont know that docker_image has to care? Just copy the path if it matches something šŸ¤·šŸ»ā€ā™‚ļø
I'm a bit adamant because imo a part of a succesful nodejs plugin ought to support deploying web-servers, and multi-stage docker is by far the most (modern) common way to do that, and right now there's no packagable target like pex for nodejs
The tools that are avail are more akin to pyoxidizer, and that is bad fit for dockerized deployments imo. Part of the strength of docker is the vendored base images that contains audited runtimes, shoving a second one in is not great.
Also, pants adoption surely suffers from
docker build . -f <my-file>
working while
pants package <my-file>
fails with filenotfound, and the solution being add boilerplate.
b
I think you might've misunderstood the theme of my message. It's completely doable, yes, but it hasn't been done yet. And not in a way that is consistent with the rest of the pants ecosystem. So I agree we should be able to do this. I've been complaining about this flavor of assumptions for half a year now šŸ˜….
w
but it hasn't been done yet. It's completely doable, yes, but it hasn't been done yet. And not in a way that is consistent with the rest of the pants ecosystem.
I'm just prodding to wrack up ammo for an issue to open for this particular use case so I can get to it and make it so, because I like the idea. The codebase is malleable enough, I think, and letting
docker_image
depend on any target with a
SourceField
doesn't really change anything fundamental in the API? There's no implicit behaviour added either really, because of docker features.
So I agree we should be able to do this. I've been complaining about this flavor of assumptions for half a year now šŸ˜….
I did find your design doc for file_deps. I think the paradigm shift is not necessary to make usage of this particular backend (docker) easier to work with.
b
I don't think it's possible without the "edge metadata" paradigm shift, because otherwise we don't have enough info to know how to consume transitive dependencies.
w
Im assuming the case
COPY path/to/dir
is what you're thinking off? Naive approach? Just hydrate the transitive dependencies under the directory?
Transitive dependencies doesn't really need to play a part in Dockerfiles, you have to name the paths to copy, no? There's no guess work as to what targets are relevant, I think? šŸ¤”
b
If you have a target with a source field that depends on another target, user expectation would be to copy both. There is a dependency, after all. So transitive targets are a must. We might've solved the OP case by just looking at direct deps, but now we're still not doing what the user expects in the "normal Pants case" šŸ˜• And just populating the sandbox with transitive deps isn't drop-in. Sandbox creation could take leagues more time due to the duplication of packageables and their deps, and like you point out existing
COPY
instructions might start getting more than expected šŸ˜•
w
If you have a target with a source field that depends on another target, user expectation would be to copy both. There is a dependency, after all. So transitive targets are a must.
Are we talking build context or docker container? The latter I disagree with, and the former I think is what I think just works?
We might've solved the OP case by just looking at direct deps, but now we're still not doing what the user expects in the "normal Pants case" šŸ˜•
There is no precedent what pants does in the normal case for docker_images, it doesn't support sources at all, currently. Conflating pex/python_distribution (search through an entry point) and docker_image copy instructions seem like an anti-pattern to me.
And just populating the sandbox with transitive deps isn't drop-in. Sandbox creation could take leagues more time due to the duplication of packageables and their deps, and like you point out existing
COPY
instructions might start getting more than expected šŸ˜•
You're describing misuse though, I dont think that's a metric to optimize for? Potentially something to warn for though. And to the
COPY
grabbing too much, sure that is not covered. What I want:
Copy code
#Dockerfile

COPY some.packaged.thing/app.pex
COPY a/file.txt
COPY standalone/python/thing.py
COPY some/dir/with/docs
When this gets hairy:
Copy code
#Dockerfile

COPY some.packaged.thing/app.pex
COPY some/packaged/thing # <-- this is not great, then, but dont you think this is also kinda unintuitive considering the above line?
b
There's currently no difference between an inferred dep (perhaps from a
COPY
line) or an explicit dep. Instead of considering what Pants should do for inferred deps, we have to consider it in the very generic case of "the dependency exists", explicit or implicit. And in that case, if a user has specified a target with a dependency, it's completely reasonable to expect the dependencies to exist in the context.
w
Okey, but wait, why is transitive dependencies problematic again?
If you have a target with a source field that depends on another target, user expectation would be to copy both. There is a dependency, after all. So transitive targets are a must.
Can I not find transitive targets given COPY instructions and explicit deps, hydrate the "transitive_tgt.closure" sources in the sandbox, place build ctx at the root and let docker do its thing? What am I missing?
because otherwise we don't have enough info to know how to consume transitive dependencies.
I think I do?
b
For the moment, forget
COPY
instructions, since inferred deps and explicit deps are the same to Pants: just direct dependencies. For a docker image target, it depends on a
file
which itself has a
python_source
dependency. In the build context we should populate the file and it's dependencies and their dependencies. Therefore we just populated the transitive targets of all our direct dependencies, and if we aren't careful that has unexpected consequences
w
For the moment, forget
COPY
instructions, since inferred deps and explicit deps are the same to Pants: just direct dependencies.
Sure, I'm just making sure I'm not lost on the way. The goal should be to construct a correct build ctx given the docker instructions, after all. Okey, I think I need elaboration on
and if we aren't careful that has unexpected consequences
for this to fall into place for me. I agree with what you said until this point. I don't see how that is any different from assembling a pex, except I let it use
SourceField
instead of
PythonSourceField
? Thank you for humouring me btw šŸ™
b
I'm hoping this thread is recruitment for the larger effort anyways, so I don't mind šŸ˜›
I don't see how that is any different from assembling a pex, except I let it use
SourceField
instead of
PythonSourceField
?
Sandbox creation is exactly 1:1 with "building a PEX" or any other process for that matter. The difference is a PEX gobbles up everything in the sandbox (-ish, but lets simplify). In Docker we can't really assume anything because the user can
COPY
wherever/whatever. Consider a dockerfile with an explicit dependency on a
pex_binary
whose instructions are just along the lines of
COPY buildroot/
w
Consider a dockerfile with an explicit dependency on a
pex_binary
whose instructions are just along the lines of
COPY buildroot/
I think most users would find success by
COPY buildroot/
trying to find the targets in
buildroot
and iterate dependencies and hydrate the full thing, so that their build ctx doesn't error out on them. The issue you're describing is real, agree, but it isn't difficult for the user who knows enough pants-foo to add explicit targets in BUILD-files to also add an ignore pattern for everything else that might be inferred by the
COPY buildroot/
, if it is a performance issue? My suspicion is that the current
BuiltPackage
inference usage is more prevalent, but I have no data on this other than what my org does.
b
Well I wasn't talking about inference in my copy instruction. Remember there can be explicit dependencies too. All this to say, we don't know what the user intends to do in their dockerfile, so doing "the obvious thing" from a list of dependencies standpoint is the only logical choice. But today we don't have enough metadata to do "the obvious thing"
w
You lost me again then, what was the point of
COPY buildroot/
in your example, what are you illustrating?
b
We can't infer what is expected to be in the build context from the dockerfile. Both in a literal sense (dependency inference) and in a metaphorical sense. We have to populate the transitive dependencies into the context because otherwise we'd be making assumptions that could be wrong
w
Right, okey, and why can I not just grab the
AddressFamily
or w/e it's called at
/build_root
, and just materialize/hydrate all the things?
b
I'm not talking about build inference for one..
But for two, that's assuming the user wants everything under a directory and not some subset
Also I think you're missing the forest for the trees here. Plugins and codegen can populate files anywhere in the sandbox, so inference alone doesn't solve the "what belong in the sandbox" question. That's why I'm pushing so hard to focus on the list of dependencies and not inference
w
Plugins and codegen can populate files anywhere in the sandbox, so inference alone doesn't solve the "what belong in the sandbox" question.
True, but they could be explicit dependencies though, right?
b
Exactly! And there is no difference between explicit dependencies and inferred dependencies by the time Pants looks at the dependencies field
So best to just assume the deps field was explicitly populated. Especially considering thats the richest kind of metadata. The user explictly told us something
w
I'm still not sure that this isn't doable with current pants deps though... Sure, not all dependencies can be calculated inference time because files/directories might not exist / are not owned when dependencies are inferred. But from what I gathered, the problem isn't with inference but rather that a user cannot state the dependencies of a docker image? Or rather, pants assumes what kinds of sources it will use, at every level. But that assumption isn't holy, imo it seems arbitrary in docker_image:s case šŸ˜… Does that really equate "we cant know what the users want"? Im skeptical.
But for two, that's assuming the user wants everything under a directory and not some subset
That is mapped to a target, yes I think so? That's already better than what
docker build .
does, imo, because that exposes everything... There's also
.dockerignore
to guide this op.
b
I really don't think thats an assumption we can make for the user though. Especially since existing dockerfiles with pants already make assumptions about, well, not this šŸ˜…
w
I really don't think thats an assumption we can make for the user though. Especially since existing dockerfiles with pants already make assumptions about, well, not this šŸ˜…
An option and/or feature flag can flip that switch, I dont think that's an issue in general?
I really don't think thats an assumption we can make for the user though.
The current behaviour is also a decision, though. Authoring a docker_image target is the highest point of friction I encountered with devs in my org, and there's slack threads in #general asking what is going on
Maybe not the greatest argument for the change, but it is there šŸ˜„
b
So there's two options: ā€¢ Take the dependencies we're given and ignore their dependencies ā€¢ Use the transitive set 1 solves your case, but still makes for a bad experience since we're ignoring metadata 2 solves the case, in general, but requires smarter Pants Thats about what this boils down to
šŸ‘ 1