https://pantsbuild.org/ logo
#general
Title
# general
s

swift-river-73520

04/05/2023, 12:46 AM
has anyone tried finding a way to use pex files with Databricks? I foresee some relatively major complications arising as a result of needing to get things installed on the cluster workers, seems like if possible it would require some pretty hacky workarounds like unzipping the pex and installing the included dependencies somehow. probably just better off creating a wheel and an entrypoint but just curious if anyone has tried it and gotten anywhere
it would just make my life a lot easier if I could bring all my dependencies with me to my clusters and pex files are obviously a great way to transport environments...
b

broad-processor-92400

04/05/2023, 12:55 AM
there's a few threads about databricks that might contain some nuggets (e.g. https://pantsbuild.slack.com/archives/C01SPQQ2WK1/p1670377180569119 is a long one). I also note https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html has a "Using PEX" section. (I know nothing about pex + databricks specifically, though)
s

swift-river-73520

04/05/2023, 1:05 AM
awesome, thanks a bunch! will do some reading then
w

wonderful-boots-93625

04/06/2023, 3:12 PM
I have totally done this, using pex to as packages to distribute to cluster nodes, and hence avoid all the python dependency mess that was in spark. Looks like the docs have some good guidance there
s

swift-river-73520

04/06/2023, 6:29 PM
yeah it's a nightmare trying to build reproducible pipelines in Databricks, I'm really excited about the possibility of using PEX to circumvent those issues. @wonderful-boots-93625 I may bug ya with a couple questions as I get going it if you don't mind
w

wonderful-boots-93625

04/06/2023, 8:14 PM
Yea no problem - although it’s been a while, and I did it for EMR - but conceptually its the same
s

swift-river-73520

04/06/2023, 8:27 PM
ah okay yeah the spark setup stuff seems like it would be the same. the part I'm a little perplexed about and just need to start experimenting with is a bit specific to databricks, namely how to configure the job such that it uses the pex file as an entrypoint - databricks doesn't directly expose a way to just run any binary when a job starts, it has to go through a python script or spark-submit or a couple other entrypoints. might need to just start trying stuff
2 Views