open question for the community has anyone worked with PEX f Pants #general

open question for the community: has anyone worked...

high-energy-55500

05/03/2022, 12:07 AM

open question for the community: has anyone worked with PEX files and Data Bricks? we’re currently trying to adopt Data Bricks to run our spark jobs. we’ve already been packaging jobs into individual PEX files which contain all the necessary first- and third-party code needed, and i really like how cleanly this works. however, it looks like Data Bricks doesn’t support PEX files and would require us to use wheels instead. does anyone here have any experience with data bricks? were you able to use PEX files in any way or were you forced to use wheels instead?

happy-kitchen-89482

05/03/2022, 4:23 AM

Funny, I'm already looking into this for another team with a similar issue. Apparently OSS spark supports PEX, but DataBricks does not

happy-kitchen-89482

05/03/2022, 4:23 AM

I was not able to find a hack that would make PEX files work

happy-kitchen-89482

05/03/2022, 4:24 AM

So you probably need to build wheels using python_distribution targets

happy-kitchen-89482

05/03/2022, 4:26 AM

The interesting part is whether you want to build "fat wheels" that contain all their internal dependencies, to make it easier to deploy. Normally this is not a good idea because it would mean that multiple published wheels might contain the same dependency code, instead of depending on that code from a separate wheel. But since these wheels aren't published, it can be OK

👍 1

high-energy-55500

05/03/2022, 9:19 AM

yeah, spark itself works just fine with PEX files (we homebrewed our own poc which did just that), but unfortunately databricks has no native way to run pex files. I'm considering trying out a hack which would run a python script that downloads the PEX and then runs it as a subprocess, but we'll see if that works at all. if not, your suggestion for a fat wheel sounds like the next best option 🙂

high-energy-55500

05/06/2022, 3:36 PM

@happy-kitchen-89482 following up here, how do i get pants to include third party dependencies when building a

python_distribution

happy-kitchen-89482

05/06/2022, 3:47 PM

You can get it to include all transitive 1stparty deps by writing your own setup.py to do so (the generated setup.py takes great pains to avoid doing so because it's generally a bad idea). But there is no way to include 3rdparty requirements in a dist.

happy-kitchen-89482

05/06/2022, 3:47 PM

No way in Python, not just in pants.

happy-kitchen-89482

05/06/2022, 3:47 PM

A dist is supposed to list the things it requires in its METADATA

happy-kitchen-89482

05/06/2022, 3:48 PM

When DataBricks uses a wheel does it not resolve that wheel's requirements?

high-energy-55500

05/06/2022, 3:52 PM

I think it should, but we just want to shorten the install time, and I think building a single fat wheel and having databricks download it from S3 should be considerably faster than downloading and installing 140+ packages 😅

happy-kitchen-89482

05/06/2022, 4:13 PM

Definitely. Annoyingly, that is exactly what PEX is for...

happy-kitchen-89482

05/06/2022, 4:14 PM

But there is no obvious way to build a superfat wheel like that, with or without Pants. You'd have to inline all the sources into a single root, and hope that their loading code doesn't assume anything about running in site-packages

happy-kitchen-89482

05/06/2022, 4:14 PM

This doesn't solve the immediate problem, but I wonder if we can lean on DataBricks to support PEX

happy-kitchen-89482

05/06/2022, 4:14 PM

I know some people there

high-energy-55500

05/06/2022, 4:15 PM

yeah, we’ve already asked if they’d consider adding native support

happy-kitchen-89482

05/06/2022, 4:15 PM

Do DataBricks have a recommended way of doing any of this?

happy-kitchen-89482

05/06/2022, 4:15 PM

Oh, what did they say?

high-energy-55500

05/06/2022, 4:16 PM

they said they’d put it on their backlog and would listen for more requests, so adding some noise will hopefully help to prioritize the feature 🙂

happy-kitchen-89482

05/06/2022, 4:18 PM

It's surprising that Spark supports it and they don't

happy-kitchen-89482

05/06/2022, 4:18 PM

I also notice this repo: https://github.com/databricks/pex

4 Views

Open in Slack

Previous Next