Dumb q of the day I have a pyspark job that runs on Glue Hap Pants #general

Dumb q of the day: I have a pyspark job that runs ...

average-breakfast-91545

06/26/2024, 11:11 AM

Dumb q of the day: I have a pyspark job that runs on Glue. Happily, I can package my python code and deps in a pex, but I also need to deploy some jars. I've added the experimental java backend, added my jars as

jvm_artifacts

in my 3rdparty directory, and run generate-lockfiles. I have a lockfile containing jars. How do I get pants to copy the resulting jars over to the dist directory? I haven't any java code, only a pex_binary. I tried adding the jars as a dep to that, but that seemed to make no difference.

fast-nail-55400

06/26/2024, 12:54 PM

Maybe either the

deploy_jar

jvm_war

target types will be of use?

fast-nail-55400

06/26/2024, 12:54 PM

The

package

goal applies to them.

average-breakfast-91545

06/26/2024, 1:18 PM

I did set up a deploy_jar, too, and included that as a dep of my pex, likewise no dice. 3rdparty/java/BUILD

Copy code

jvm_artifact(
    name="hadoop-cloud",
    group="org.apache.spark",
    artifact="spark-hadoop-cloud_2.13",
    version="3.5.1",
)

deploy_jar(
    name="spark-jars",
   dependencies=[":hadoop-cloud"],
   main="",
)

src/ingest/BUILD

Copy code

pex_binary(
    name="spark",
    tags=["artifact"],
    entry_point="spark_script.py",
    dependencies=[
        "!!3rdparty/python:ingest#awsglue-dev",
        "!!3rdparty/python:ingest#pyspark",
        "!!3rdparty/python:shared#click",
        "3rdparty/java:spark-jars"
    ],
)

Running a pants package yields me a spark.pex, but doesn't seem to do anything jar-wise. I can explicitly pants package 3rdparty/java though, so maybe that'll have to do for today.

fast-nail-55400

06/26/2024, 1:23 PM

You could use https://www.pantsbuild.org/2.21/reference/targets/experimental_wrap_as_resources to wrap the

deploy_jar

as a "resource" target. Then include that resource target as a dependency of the pex.

🤔 1

average-breakfast-91545

06/26/2024, 1:24 PM

I'll give it a go.

fast-nail-55400

06/26/2024, 1:25 PM

Then the jar should be on the Python path and accessible via

pkgutil.get_data

importlib.resources.*

fast-nail-55400

06/26/2024, 1:25 PM

(e.g. read_binary but there is probably one that gives you the path)

average-breakfast-91545

06/26/2024, 1:26 PM

In all honesty, that doesn't matter to me. I just need it written out to dist so that I can deploy it from our ci pipeline. If we only resolve and copy it when either the jar or the pex_binary change, that's all the better, but it doesn't actually need to be available to Python.

fast-nail-55400

06/26/2024, 1:28 PM

ah then I misunderstood. Then why not just run

pants package

with multiple targets i.e., the pex and the deploy_jar when run in CI?

13 Views

Open in Slack

Previous Next