Can anyone point me at the code that makes pants g...
# general
f
Can anyone point me at the code that makes pants go "oh, hey, I can use protobuf to generate that module. let's run it!"?
Oh, is it
FirstPartyPythonMappingImpl
? How does one extend that for non-python code?
h
Not quite, the mapping just supports dependency inference.
The protobuf codegen backends are under
src/python/pants/backend/codegen/protobuf
Mostly in the various
rules.py
And you can see which code is shared and which is separate, across the python, java and go implementations
f
I spent some time reading exactly that code before asking, but couldn't find the magic. I could not figure out how to add a dependency on anything other than a Target listed in a BUILD file.
h
That gives a decent overview of how the pieces fit together
Basically, the glue is:
Copy code
class GeneratePythonFromProtobufRequest(GenerateSourcesRequest):
    input = ProtobufSourceField
    output = PythonSourceField
Which says “I know how to generate Python sources from Protobuf sources, so invoke me if you see a dep from something that knows how to consume Python sources onto protobuf sources”
Then, for convenience, it’s best if that dep is inferred. But you can start out with an explicit dep just to see if the other moving parts do what you expect
f
Okay, yes. I've got it working so that if I do
export-codegen
I get the files generated. Getting concrete, I'm successfully generating
foo.h
. I have a source file that #includes foo.h, but I can't find a way to add the generated foo.h into the inferred dependencies. It worked fine when I had it as a source file, but after switching to generated, it's not in AllTatgets.
Rereading what you wrote... so when my inference logic finds the
#include foo.h
and determines that there isn't a foo.h target, it should look for possible upstream sources that could turn into a foo.h and infer the dependency on those?
Meaning my inference engine needs to understand my generation DAG in reverse?
Ah! I suspect my issue has to do with this comment in pants.engine.target:
Copy code
For generated first-party addresses, use
`./` for the file path, e.g. `./main.py:tgt`; for all other generated targets,
use `:tgt#generated_name`.
oh, happy dance! I got the header generation working using that syntax.
But now the newly generated header isn't having its dependencies inferred. :(
h
Its dependencies on, e.g., non-generated code?
For context, sounds like you’re working on a custom C/C++ plugin of some kind?
Re your question above: Yes, exactly: your inference logic finds
#include foo.h
, determines that
foo.h
isn’t provided by checked-in source, but can be provided by the code generator, and adds a dep on
foo.proto
(or more precisely the relevant target containing it as a source).
f
Yes, something like that. It's not C, but it does use the C preprocessor.
And my generated header itself has a #include inside of it, but that header (primary source file, in this case) isn't showing up in the Transitive dependencies, probably because foo.h didn't exist yet when it was computed.
h
Ah hmm, yes, that is tricky. You’d have to write dep inference logic that can look at a
.proto
and figure out which deps it generates.
To start with, just so you can make progress, does it work if you manually add the dep to the
dependencies=
field of the
protobuf_sources
target?
f
Yes. That works
Could I do something like get the transitive targets, hydrate the sources, and then iterate until the list stabilizes? Hopefully with caching it wouldn't be too painful?
Or I guess a proper graph traversal would make more sense, but same basic idea... generate more dependencies and then rescan for new ones.
Or does inference happen so early in Target construction time that there's not really a good way to repeat it?
h
Yes, IIRC it happens early , and it would take some non-trivial changes for the build graph to be constructed in an iterable way. Although that would be ideal.
I have a feeling that we don’t do anything similar today with python/jvm/go, because the deps of the generated code are conceptually known before generation happens. They are either A) upstream generated code, or B) proto API code.
Sounds like you have a third case
f
Yes. Without going into too much detail we currently have a Makepp based system that allows us to dynamically discover both new dependencies and new targets while the build is in flight and just splice them into the build graph. And since that capability is there, users have taken advantage of it in ways we never anticipated. Not ideal, but it's entrenched.
I'll have to think about it some more, but I think we can probably deal with only inferring dependencies from primary sources. The bigger problem is going to be the dynamic addition of "targets" (to use the make term) to the graph.
For example, we have one tool flow that builds up a database from lots of sources - more or less similar to the C data flow of source file -> object file -> library. But then there's a new set of header files generated from the database by another tool, and the names of those files depend on the names of the objects defined in the source code (which generally don't match the source file names).
I'm not even sure how I would express a dependency on a 2nd generation derived file in the Address syntax.
foo.c, bar.c -> foo.o, bar.o -> libfoobar.so main.c -> main.o, libfoobar.so -> mainfoobar.exe
libfoobar.so depends on foo.c#foo.o, but main.c doesn't depend on anything, so how does libfoobar.so get pulled in for linking mainfoobar.exe?
(again, my flow isn't compiling C code, but it illustrates the example reasonably well. note: this isn't the database driven flow I mentioned above, that's even more complex so I want to understand this easier one first)
Another interesting dynamic target case we have... a tool will generate any file you ask for with the file name pattern aaa_x_y_z where the contents of the file are determined by the values of x, y, and z. And some extra "accessory" files get generated too, the names of which are effectively random, but deterministic. And the (x, y, z) tuple space is so large there's no way to predict what values will be needed prior to inferring a dependency on the aaa_x_y_z file.
h
In answer specifically to “libfoobar.so depends on foo.c#foo.o, but main.c doesn’t depend on anything, so how does libfoobar.so get pulled in for linking mainfoobar.exe?“: You’d have a
cc_binary()
target (or similar, I’m pretending that your code is C/C++ for simplicity) that has explicit
dependencies
on its entry point, and then dep inference can take over.
In your case, how can
main.c
not depend on anything? I would have assumed it must
#include
and invoke the code in the lib?
But if not, then the lib would need to be an explicit dep as well
f
sure, but it's just a header file inclusion, building main.o doesn't need the .so object.
h
But zooming out to the general problem, your use case is really interesting, and kind of the opposite of the JVM use case. In JVM you need to compile the deps of foo.java before you can compile foo.java itself, because the classfiles of those deps must be on the compiler’s classpath when foo.java is compiled. This means that deps have to be inferred entirely from sources, as they must all be recursively known before any compiling can happen. But in your case it sounds like you have preprocessor phase that is used for dynamic dep discovery. So you preprocess in reverse order, and then compile everything concurrently (since the preprocessor gives you independent translation units)?
f
Yes, good summary
h
Sure, each
.o
builds entirely independently after the preprocessor runs, since the result of the preprocessor is a single file translation unit that can be compiled with no further inputs. You only bring everything together at link time.
C/C++ is radically different from JVM in this regard.
(and I prefer the preprocessor model to the JVM model, but that’s just my 2 cents)
So, this is definitely possible, but you may have to ignore some existing Pants mechanisms, and go one level lower in the APIs
Is this something you can put in a public repo for us to look at?
f
probably not the real stuff, but I imagine I can construct a synthetic C example that shows the case. Just a simple Makefile to use to demonstrate the flow?
h
So you need rules to implement something like “start from some given ‘root’ files, run the preprocessor on each, examine the output of that preprocessor, infer deps from it, and recurse on that process until you’ve built up a transitive closure of preprocessed translation units”
Then you compile all those translation units entirely independently (I’m assuming a C-like model here)
Then you link them all together
Does that sound right?
f
Yes, that's right
h
Ah, but protobuf complicates things
How does that fit in?
f
note that it's not actually protobuf, we don't have any python code in this particular flow
I was just using it as an example of a preprocessor
h
oh
In that case, I think you can achieve that quite straightforwardly by ignoring most of the existing Pants machinery relating to dep inference and targets and all that. That was designed with different use cases in mind, and is very heavyweight. I would suggest writing this more or less from first principles. You do need some input targets, but I’m not sure you need to model all the intermediate stuff as targets. Just create your own ad-hoc dataclasses, and keep it all as lightweight as possible. If you can do this in the open we can advise.
f
Would I still be able to hook into the graph traversal system by doing that?
h
What would be the need to do so?
But yes, you can still get Targets from Addresses as needed
f
I'd still want to be able to call a goal and have Pants build up the graph from primary sources through the conversion steps to the goal.
And especially take advantage of the caching and remote execution features.
h
So caching and remote execution is baked in to the engine at a very low level. So any time you invoke a
Process
, those are applied to it
You aren’t missing out on that by working at a lower level
And you can still hook into a goal
It’s more about how to model all the intermediate stuff
So for example, you may not need to generate
Target
instances for every source file your preprocessor creates. You could just track files directly.
That’s a bit handwavy, of course the devil is in the details. That’s why I’m wondering if you can work on a redacted version of this in the open.
f
At first glance, it looks to me like it would be easiest to just create a new IntermediateTarget class that has a GeneratedSources instead of a SourcesField and the Dependencies are not inferred until after the sources have been Hydrated. But I realize that just because I can describe it that way, doesn't mean anything is setup to allow it to be implemented.
I'll check in with my coworkers and see what we can cook up that can be worked publicly.
h
Well, I’m not sure you need a
Target
or
Field
class of any kind, is my point.
Start without and see how far you get
I mean, you do need a
cc_binary()
-like target type to represent the binary, an actual one in a BUILD file. And you need a
cc_sources()
-like target type to represent the sources on disk
But I’m not sure you have to represent intermediate generated sources that way
Also I should mention that while generally we are switching to calling rules by name: https://github.com/pantsbuild/pants/issues/19730
It is not properly documented yet
But there are issues with calling by name recursively, so you might need to continue using
Get
until we iron those out
f
ah. I did see some of that when reading code and I was wonder what was up with the that calling syntax.
h
I would recommend using call-by-name wherever you can, and if you get rule graph construction failures that switching to
Get
resolves, then that is likely the recursive issue
Which obviously we will fix