When I build a distribution, how do I include data...
# general
h
When I build a distribution, how do I include data files?
1
Normally this is controlled by
<http://MANIFEST.in|MANIFEST.in>
, but I'm not seeing mentions of that in pants docs.
f
h
I don't think that's the same thing
I'm trying to build a
python_distribution
wheel that includes non-python files. When I build a pex of my system, it's exactly what I expect and includes my data files.
f
The code in
src/python/pants/backend/python/goals/setup_py.py
makes uses of
ResourceSourceField
, so my inclination that it could be relevant.
What happens if you add the
resources
target as a direct dependency of the
python_distribution
target?
h
It's already a transitive dependency, but I can try it.
h
Are you using Pants-generated setup.py, or writing your own?
If the former, then it should include resources
h
A pants-generated one
I can see that my distribution does depend on these files
Copy code
./pants paths --from=//:ops_distribution --to=astranis-python/astranis/gnc/external/iers/iau2000A_finals_ab.txt 
[
  [
    "//:ops_distribution",
    "ops/scripts/geosat_1/odet/spacex_opm_to_odet_prior.py:../../all_scripts",
    "astranis-python/astranis/gnc/common/coordinate_transforms.py",
    "astranis-python/astranis/gnc/refsys/refsys_earth_orientation.py",
    "astranis-python/astranis/gnc/external/iers/iau2000A_finals_ab.txt:iau_finals"
  ],
  [
    "//:ops_distribution",
    "ops/scripts/geosat_1/payload_iot/antenna_characterization/antenna_characterization_slew.py:../../../all_scripts",
    "astranis-python/astranis/gnc/payload/payload_coordinate_transforms.py",
    "astranis-python/astranis/gnc/refsys/refsys_earth_orientation.py",
    "astranis-python/astranis/gnc/external/iers/iau2000A_finals_ab.txt:iau_finals"
  ],
  [
    "//:ops_distribution",
    "ops/scripts/geosat_1/payload_iot/antenna_characterization/antenna_characterization_slew.py:../../../all_scripts",
    "astranis-python/astranis/gnc/payload/payload_coordinate_transforms.py",
    "astranis-python/astranis/gnc/ephem/ephem.py",
    "astranis-python/astranis/gnc/refsys/refsys_earth_orientation.py",
    "astranis-python/astranis/gnc/external/iers/iau2000A_finals_ab.txt:iau_finals"
  ]
]
But if I
./pants package
my distribution, they aren't in the wheel
Here's what my distribution looks like
Copy code
python_distribution(
    name="ops_distribution",
    sdist=False,
    provides=python_artifact(name="ops_dist", version="0"),
    dependencies=[
        "//ops/scripts:all_scripts",
        "//astranis-python/astranis/ground/target_interface/script_runner.py",
        "//astranis-python/astranis/shell/backdoor.py",
    ],
    entry_points={
        "console_scripts": {
            "run_ops_script":
            "astranis.ground.target_interface.script_runner:main",
            "backdoor":
            "astranis.shell.backdoor:main"
        }
    })
all_scripts
is a
python_sources
target with a glob pattern like
**/*.py
in case that's helpful.
f
what is the target type for
astranis-python/astranis/gnc/external/iers/iau2000A_finals_ab.txt:iau_finals
?
(just to confirm for clarity)
h
resources
specifically
Copy code
resources(
    name="iau_finals",
    sources=["*iau2000A_finals_ab.txt"],
)
I use this in my system by making a separate pex that I can copy into my container that looks like this
Copy code
pex_binary(
    name="executable_scripts",
    script="conscript",
    dependencies=[":ops_distribution", ":reqs#conscript"])
If I package that and unzip it, I see the
iau
files I want in that unzipped directory
Explicitly adding
generate_setup=True
did not change the behavior
Browsing the file contents of the wheel, I do see non-python files in other places. Various prototxt files and such. I'm not sure why this is different.
f
I wonder if this comment at https://github.com/pantsbuild/pants/blob/f6b2832539e26ddffb19a7d665b9cef697abd0da/src/python/pants/backend/python/goals/setup_py.py#L1047 is relevant:
Copy code
# If resource is not in a package, ignore it. There's no principled way to load it anyway.
is the resources target you had mentioned under a Python package?
h
I don't think so. It's just a transitive dependency of one of our modules, but it lives in a different folder.
I do not understand that comment at all, though. We have plenty of non-py files that lives under directories that only contain non-py files and they can get pulled into other repos just fine because of our MANIFEST.in
As a silly example, here's kind of what this looks like
Copy code
foo/
  - bar.py
tree/
  - file.txt
bar.py
depends on
file.txt
in my scenario.
h
Hmmm, this is vexing. Would you be able to create a small dummy github repo that reproduces this, with anything proprietary redacted? That’s usually the easiest way to debug stuff like this
Probably something nuanced is going on
h
A little crunched on time to fix this so odds are low
h
Is
astranis-python/astranis/gnc/external/iers/iau2000A_finals_ab.txt
under a source root? (as seen by running
./pants roots
)
And, how is that file being used in your code?
i.e., by what relative path is it being accessed?
h
Yes, it is. We're able to use it happily in unit tests and such and that's why it shows up in
./pants paths
. We use it by doing things like
pathlib.Path(__file__)
in one place and then navigating relative to that module to grab it.
I've confirmed that it's not just a path mixup error as the
astranis/gnc/external
folder never shows up in my wheel when I unzip and inspect it.
I also did a
find
with the unzipped wheel to make sure it wasn't anywhere in the wheel contents.
h
Very odd, if it’s under a source root, and wrapped in a
resources()
target then it should be there, as those other files are. Sounds like a subtle bug.
If you’re able to provide a repro, I’m happy to dive in and debug
h
I appreciate it. I'll be sure to tag you in if I'm able to pull something together.
f
do you have the ability to run Pants from a sibling directory to your repository via a
pants_from_sources
script?
if so, this diff may provide some additional logging that could be of use
h
As an update, I'm not able to reproduce with a simpler repository. Still poking...
I have learned more though. We currently have something like
Copy code
lib/
  external/
    data/
       -> BUILD.pants
       -> my_data.txt
  data_user/
    -> BUILD.pants
    -> my_module.py
In this scenario, despite using
resources
in
lib/external/data/BUILD.pants
and properly linking that to
lib/data_user/my_module.py
in its BUILD file, we do not see the data file in the built distribution.
However, if I change this to
Copy code
lib/
  data_user/
    -> BUILD.pants
    -> my_module.py
    -> my_data.txt
and declare the resource in
BUILD.pants
and link it to
my_module.py
, I do see it in my output distribution.
Is it possible to inspect the generated
setup.py
and
<http://MANIFEST.in|MANIFEST.in>
? I think that would go a long way in debugging this.
nvm, figured that out
Okay, more info
If I change to this
Copy code
lib/
  external/
    data/
       -> BUILD.pants
       -> my_data.txt
       -> dummy.py
  data_user/
    -> BUILD.pants
    -> my_module.py
and then say that
my_module.py
depends on
dummy.py
in
lib/data_user/BUILD.pants
, then that gets included in my built distribution.
So there's something going on here related to what pants is thinking counts as a package or not.
Worth noting: we don't use
__init__.py
in our repo. We rely on namespace packaging.
Confirmed that if I add logging at that section, all my data files that are missing are being ignored. So I think one outcome (and maybe can be my first pull request if folks agree) is to be a bit louder about resources that couldn't be included in the distribution rather than silently returning.
Is this an indication that a pants distribution would not always intuitively work with a pep 420 compliant repo?
h
Thanks for the debugging effort. Yeah, that sounds like a footgun… A PR to add a warning there makes sense.
But even more sense is fixing the underlying issue, I guess
In your scenario, how are you loading my_data.txt? I.e., by which path, relative to which package?
In other words, what would you expect to see in the generated setup.py’s package_data?
h
We add the
resources
target associated with the file we want to the dependencies of a
python_sources
target in another directory. We expect the built package to place things in a similar relative path so we do something like
pathlib.Path(__file__).parent / '<path to thing>'
to grab.
Personally, I'm more familiar with using
<http://MANIFEST.in|MANIFEST.in>
to control data files when building distributions. I'm not sure how I would expect
package_data
to look in this scenario.
h
I see.
<http://MANIFEST.in|MANIFEST.in>
doesn’t distinguish between source and data files, IIUC
h
Correct. Also, I think I know how to reproduce this. Need to tweak the example repo some more.
Copy code
~/devel/pants-distribution-bug (distribution-bug-repro)$ ./pants dependencies --transitive //:test_distribution
//:reqs#ansicolors
//:reqs#setuptools
//:reqs#types-setuptools
//requirements.txt:reqs
helloworld/external/data/top_secret.txt:secret_data
helloworld/greet/__init__.py:lib
helloworld/greet/greeting.py:lib
helloworld/greet:translations
helloworld/translator/translator.py:lib
~/devel/pants-distribution-bug (distribution-bug-repro)$ ./pants package //:test_distribution
10:48:48.50 [INFO] Wrote dist/test_dist-0-py3-none-any.whl
10:48:48.50 [INFO] Wrote dist/test_dist-0.tar.gz
~/devel/pants-distribution-bug (distribution-bug-repro)$ unzip -l dist/test_dist-0-py3-none-any.whl 
Archive:  dist/test_dist-0-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2022-10-26 16:47   helloworld/greet/__init__.py
     1317  2022-10-26 16:47   helloworld/greet/greeting.py
      323  2022-10-26 16:47   helloworld/greet/translations.json
      249  2022-10-26 16:47   test_dist-0.dist-info/METADATA
       92  2022-10-26 16:47   test_dist-0.dist-info/WHEEL
       62  2022-10-26 16:47   test_dist-0.dist-info/entry_points.txt
        1  2022-10-26 16:47   test_dist-0.dist-info/namespace_packages.txt
       11  2022-10-26 16:47   test_dist-0.dist-info/top_level.txt
      737  2022-10-26 16:47   test_dist-0.dist-info/RECORD
---------                     -------
     2792                     9 files
Okay, reproduced. I'll push to my repo and open a bug ticket so we can discuss further there.