open question for the community: has anyone worked...
# general
h
open question for the community: has anyone worked with PEX files and Data Bricks? we’re currently trying to adopt Data Bricks to run our spark jobs. we’ve already been packaging jobs into individual PEX files which contain all the necessary first- and third-party code needed, and i really like how cleanly this works. however, it looks like Data Bricks doesn’t support PEX files and would require us to use wheels instead. does anyone here have any experience with data bricks? were you able to use PEX files in any way or were you forced to use wheels instead?
h
Funny, I'm already looking into this for another team with a similar issue. Apparently OSS spark supports PEX, but DataBricks does not
I was not able to find a hack that would make PEX files work
So you probably need to build wheels using python_distribution targets
The interesting part is whether you want to build "fat wheels" that contain all their internal dependencies, to make it easier to deploy. Normally this is not a good idea because it would mean that multiple published wheels might contain the same dependency code, instead of depending on that code from a separate wheel. But since these wheels aren't published, it can be OK
👍 1
h
yeah, spark itself works just fine with PEX files (we homebrewed our own poc which did just that), but unfortunately databricks has no native way to run pex files. I'm considering trying out a hack which would run a python script that downloads the PEX and then runs it as a subprocess, but we'll see if that works at all. if not, your suggestion for a fat wheel sounds like the next best option 🙂
@happy-kitchen-89482 following up here, how do i get pants to include third party dependencies when building a
python_distribution
?
h
You can get it to include all transitive 1stparty deps by writing your own setup.py to do so (the generated setup.py takes great pains to avoid doing so because it's generally a bad idea). But there is no way to include 3rdparty requirements in a dist.
No way in Python, not just in pants.
A dist is supposed to list the things it requires in its METADATA
When DataBricks uses a wheel does it not resolve that wheel's requirements?
h
I think it should, but we just want to shorten the install time, and I think building a single fat wheel and having databricks download it from S3 should be considerably faster than downloading and installing 140+ packages 😅
h
Definitely. Annoyingly, that is exactly what PEX is for...
But there is no obvious way to build a superfat wheel like that, with or without Pants. You'd have to inline all the sources into a single root, and hope that their loading code doesn't assume anything about running in site-packages
This doesn't solve the immediate problem, but I wonder if we can lean on DataBricks to support PEX
I know some people there
h
yeah, we’ve already asked if they’d consider adding native support
h
Do DataBricks have a recommended way of doing any of this?
Oh, what did they say?
h
they said they’d put it on their backlog and would listen for more requests, so adding some noise will hopefully help to prioritize the feature 🙂
h
It's surprising that Spark supports it and they don't
I also notice this repo: https://github.com/databricks/pex