loud-laptop-89838
12/07/2022, 1:39 AMbored-energy-25252
12/07/2022, 3:32 AMmanage databricks deployments using dbx,Here is the related work: https://github.com/da-tubi/pants-pyspark-pex
bored-energy-25252
12/07/2022, 3:32 AMhappy-kitchen-89482
12/07/2022, 7:25 AMbored-energy-25252
12/07/2022, 7:28 AMbored-energy-25252
12/07/2022, 7:29 AMpip install
talks a long time. Using PEX will help us save costs.busy-vase-39202
12/07/2022, 9:33 AMloud-laptop-89838
12/07/2022, 11:27 AMloud-laptop-89838
12/07/2022, 11:30 AMloud-laptop-89838
12/07/2022, 11:42 AMhappy-kitchen-89482
12/07/2022, 5:03 PMloud-laptop-89838
12/07/2022, 5:07 PMloud-laptop-89838
12/07/2022, 5:08 PMhappy-kitchen-89482
12/07/2022, 5:09 PMhappy-kitchen-89482
12/07/2022, 5:09 PMloud-laptop-89838
12/07/2022, 5:09 PMloud-laptop-89838
12/07/2022, 5:11 PMbusy-vase-39202
12/07/2022, 5:50 PMbusy-vase-39202
12/07/2022, 5:52 PMloud-laptop-89838
12/07/2022, 7:38 PMbusy-vase-39202
12/07/2022, 7:46 PMhappy-kitchen-89482
12/07/2022, 9:03 PMloud-laptop-89838
12/07/2022, 9:03 PMhappy-kitchen-89482
12/07/2022, 9:05 PMbusy-vase-39202
12/07/2022, 10:09 PMbusy-vase-39202
12/07/2022, 10:14 PMfaint-dress-64989
01/12/2023, 3:32 PMloud-laptop-89838
01/12/2023, 3:34 PMloud-laptop-89838
01/12/2023, 3:37 PMpex_binary
targets, but for jobs and pipelines. Individually only the job/pipeline requirements and deploy them all in parallel using the databricks REST API. Similar to what their dbx
package does now, but with individual targets instead of a single large deployment.loud-laptop-89838
01/12/2023, 3:41 PMfaint-dress-64989
01/12/2023, 3:49 PMloud-laptop-89838
01/12/2023, 3:51 PMloud-laptop-89838
01/12/2023, 3:51 PMdatabricks
cli to upload artifacts and then configure the jobs with that location.faint-dress-64989
01/12/2023, 4:00 PMloud-laptop-89838
01/12/2023, 4:53 PM./pants databricks deploy
or whatever, it packages the required artifacts, deploys them to S3 (in my case Azure) using databricks cli and creates the jobs in the same way. You could use macros to define different environments or test deployments.loud-laptop-89838
01/12/2023, 4:53 PMhappy-kitchen-89482
01/12/2023, 5:29 PMhappy-kitchen-89482
01/12/2023, 5:29 PMhappy-kitchen-89482
01/12/2023, 5:29 PMfaint-dress-64989
01/12/2023, 6:18 PMloud-laptop-89838
01/12/2023, 6:47 PMloud-laptop-89838
01/12/2023, 6:48 PMloud-laptop-89838
01/12/2023, 6:48 PMrun_pex
job, and you pass in the pex file location as an argument.loud-laptop-89838
01/12/2023, 6:50 PMloud-laptop-89838
01/12/2023, 11:47 PMbored-energy-25252
01/13/2023, 5:05 AMbored-energy-25252
01/13/2023, 5:06 AMYou could definitely upload the pex file, no problem. The problem is running it as a job because databricks only supports certain job types and pex isn't one of them.Running a Databricks Job works fine, but it seems there is not way to make Databricks Notebook work with PEX.
bored-energy-25252
01/13/2023, 5:16 AMpip install
.bored-energy-25252
01/13/2023, 5:17 AMhappy-kitchen-89482
01/13/2023, 11:41 AMswift-river-73520
04/06/2023, 8:16 PMswift-river-73520
04/06/2023, 8:25 PMpython_wheel_task
to basically just execute a script that's part of a module in the pex file, since I think setting PYSPARK_PYTHON=<pex_file> should make everything in the pex file importable. I think I would just need to have an init script to copy the pex to the cluster and set the PYSPARK_PYTHON env var in the cluster config + set spark.files:<path_to_pex>
spark configswift-river-73520
04/06/2023, 8:55 PMhappy-kitchen-89482
04/06/2023, 8:58 PMswift-river-73520
04/06/2023, 9:34 PMloud-laptop-89838
04/09/2023, 9:42 AMswift-river-73520
04/10/2023, 5:04 PMPYSPARK_PYTHON
in the job config will cause this startup step to fail. Using a regular entrypoint script that subprocesses out to the pex file seems to fail because the spark session that gets instantiated in my pex subprocess is missing crucial config like the spark host URL, which I assume is set by Databricks during startup. I've done some work trying to create a docker image that I can run a job with which unpacks the pex file into a virtualenv but have had some issues getting Databricks to recognize / use it - I'm still hopeful I can get that approach to work as it allows quite a bit of customization.
I've also tried a spark-submit job pointing to a python script as an entrypoint, distributing the pex file using --files <pex_file>
, but this fails on startup with this message in the driver logs - /usr/bin/env: 'python3.9': No such file or directory
- I've seen this error when trying to run my pex file locally in an environment that doesn't have a python3.9
alias, so I suspect maybe there's a way to build my pex file so that it looks for python
/ python3
or some specific path on startup, but so far haven't made any progress there when modifying the search_path
in pants.toml
, thinking maybe I need to change the interpreter_constraints but haven't gone down that rabbithole yetswift-river-73520
04/10/2023, 11:15 PMdatabricksruntime/standard:11.3-LTS
base image. In the dockerfile I create a virtualenv using the tools built into my pex file, and then copy the site-packages
from the venv into the /databricks/python3
venv that the databricksruntime/standard:11.3-LTS
base image provides. then set PYSPARK_PYTHON=/databricks/python3/bin/python
. I think now that it's kind of working I'll go back and see if I can't just create the /databricks/python3
venv directly from the pex file - I had tried that before but there were a handful of customizations that databricks does to that environment to get it to work on a cluster. just gotta track down those steps by tracing the base docker images back from databricksruntime/standard:11.3-LTS
and pulling in those setup steps
I couldn't find any way to call the pex file directly in a databricks job, it might still be possible but I suspect it would require a custom image of some kind. it might also be possible in a spark-submit
job, with this approach I got to the point where I thought it was running the pex file but it just couldn't find a python3.9
alias - so maybe a custom image + spark-submit
could work, will play with it if I get time.quick-midnight-96336
04/26/2023, 8:34 PM