Dumb q of the day: I have a pyspark job that runs ...
# general
a
Dumb q of the day: I have a pyspark job that runs on Glue. Happily, I can package my python code and deps in a pex, but I also need to deploy some jars. I've added the experimental java backend, added my jars as
jvm_artifacts
in my 3rdparty directory, and run generate-lockfiles. I have a lockfile containing jars. How do I get pants to copy the resulting jars over to the dist directory? I haven't any java code, only a pex_binary. I tried adding the jars as a dep to that, but that seemed to make no difference.
f
Maybe either the
deploy_jar
or
jvm_war
target types will be of use?
The
package
goal applies to them.
a
I did set up a deploy_jar, too, and included that as a dep of my pex, likewise no dice. 3rdparty/java/BUILD
Copy code
jvm_artifact(
    name="hadoop-cloud",
    group="org.apache.spark",
    artifact="spark-hadoop-cloud_2.13",
    version="3.5.1",
)

deploy_jar(
    name="spark-jars",
   dependencies=[":hadoop-cloud"],
   main="",
)
src/ingest/BUILD
Copy code
pex_binary(
    name="spark",
    tags=["artifact"],
    entry_point="spark_script.py",
    dependencies=[
        "!!3rdparty/python:ingest#awsglue-dev",
        "!!3rdparty/python:ingest#pyspark",
        "!!3rdparty/python:shared#click",
        "3rdparty/java:spark-jars"
    ],
)
Running a pants package yields me a spark.pex, but doesn't seem to do anything jar-wise. I can explicitly pants package 3rdparty/java though, so maybe that'll have to do for today.
f
You could use https://www.pantsbuild.org/2.21/reference/targets/experimental_wrap_as_resources to wrap the
deploy_jar
as a "resource" target. Then include that resource target as a dependency of the pex.
🤔 1
a
I'll give it a go.
f
Then the jar should be on the Python path and accessible via
pkgutil.get_data
or
importlib.resources.*
(e.g. read_binary but there is probably one that gives you the path)
a
In all honesty, that doesn't matter to me. I just need it written out to dist so that I can deploy it from our ci pipeline. If we only resolve and copy it when either the jar or the pex_binary change, that's all the better, but it doesn't actually need to be available to Python.
f
ah then I misunderstood. Then why not just run
pants package
with multiple targets i.e., the pex and the deploy_jar when run in CI?