Thanks <@U051221NF>. I'm considering it for my sma...
# welcome
l
Thanks @happy-kitchen-89482. I'm considering it for my small DataOps team. Our stack is all Azure currently. On top of the typical lint/format/test, my hope is to use it to manage databricks deployments using dbx, building docs using jupyter book, and deploying azure function apps. I know those integrations don't exist yet, so if you know of any plugins that have done it, or just tips for building out plugins in addition to what's in the docs, it would be greatly appreciated! One of the main things I want to figure out with the plugins is what fits best in Pants, and what fits best in a script/cicd. For example, if I have a long-running pipeline integration test the process might be as follows: Deploy a subset of test pipelines, launch those pipelines concurrently, monitor the status and wait for them all to complete, if all pipelines pass, deploy the production pipelines, restart production pipelines. I can do those things currently using dbx/databricks cli with a bit of scripting in Python that is managed in an Azure Devops pipeline. I would be curious to know whether and where Pants could fit into a process like that and how to approach creating a plugin if it would fit.
👋 5
b
manage databricks deployments using dbx,
Here is the related work: https://github.com/da-tubi/pants-pyspark-pex
We are using Databricks too. And using PEX will help us save time on pip install.
h
Does databricks now allow you to deploy a pex? Open-source spark does, but I was under the impression that databricks does not
b
I haven’t tested, but I guess it does. But I have to install pex by init scripts.
For PyTorch, Tensorflow,
pip install
talks a long time. Using PEX will help us save costs.
b
@bored-energy-25252 would your https://github.com/da-tubi/jupyter-notebook-best-practice also be relevant for Eric?
l
Yea @bored-energy-25252 shared that with me in #general !
That's good to know @bored-energy-25252. Do you know off hand whether dbx will recognize a pex file in the dist folder? Or you are saying that you install using an init script so you'd upload the file and then install on the cluster, rather than installing from a private repo?
Because currently dbx will build and deploy my package automatically using poetry, but i think it's deployed as a wheel file
h
Let us know in #general how this goes? We had other users tell us that databricks doesn't support directly deploying a pex (even though oss spark does), and I had been trying to talk to folks I know at databricks about adding support for this. But if it works already...
l
Will do. I have a contact at databricks that I can talk to about it too. I think the issue is that dbx currently supports setuptools and poetry. Would just need to add pants maybe? I could create an issue in dbx at some point but I don't fully understand yet how it would be implemented...
But it would work if I just marked the file to upload to dbfs, then attached an init script to my pipelines. It's just as nice as a native build tool.
h
Yeah they would just need to support Pex as a format, instead of a dist. And it is a good idea for them, since pex is self-contained, and a distribution is very much not...
We are happy to work with them on this
l
So would it be at the dbx package level or actual databricks platform?
And then would it be supporting pants as a build tool at the dbx level?
b
May I interject? If you've got a contact over there that you can personally speak to, might it be possible to get you, them, and Benjy into a brief meeting together to figure out what the best approach would be? As Benjy mentioned, we're always happy to collaborate. And ideally it'd be a solution that both projects feel comfortable maintaining.
Having a driving use case like yours can be really helpful for coming up with a design that makes sense for everyone.
l
Sure. Let's see what I can do. Is the ask that Pants would like to better integrate with Databricks and to have a conversation along those lines to figure out the best approach?
b
I'll leave that to you and @happy-kitchen-89482 to articulate. But seems to me it's legit to express the need you raised here, and that our team would love to collaborate with both you and Databricks to come up with a solution to a need that others have told us they share with you.
h
I think it starts with "ingest .pex files" and then Pants can add a "publish pexes to databricks" plugin
🎉 1
l
I reached out to my contact. Let me know what is the best way to follow up
h
Feel free to DM me when they get back to you!
b
Good luck, folks! I hope something exciting comes of it <fingers crossed>...
@happy-kitchen-89482 is there any open ticket about this? Databricks might find it helpful to see previous requests, to get a better picture of the need and that it's been voiced by more than one organization.
f
In the process of doing a wide search for "databricks" I came across this thread. I'm looking to move our spark jobs (whl deployed on databricks) over to our pants repo. Did anyone every figure out if databricks supports pex files? I also assumed I'd need to continue using my custom scripts for deployments (s3 upload + terraform). As far as I know pants does not support s3 or terraform as publish targets.
l
@faint-dress-64989 check out the discussion here: https://github.com/pantsbuild/pants/discussions/17802 Not much there yet, but building a databricks plugin is on my medium term roadmap. I'm hoping to get to it over the next month or two, finish in 4-6 months? It'll definitely help if you're interested in lending a hand.
The idea would be to do something like the existing
pex_binary
targets, but for jobs and pipelines. Individually only the job/pipeline requirements and deploy them all in parallel using the databricks REST API. Similar to what their
dbx
package does now, but with individual targets instead of a single large deployment.
not sure if that works for your workflow, but add your comments to the discussion if you think it'll help. Otherwise you could probably just create a custom plugin that essentially subclasses the wheel binary targets and adds an S3 upload option.
f
My workflow is to upload whl file to s3 then deploy via terraform. I'm not sure where dbx expects your job artifact to be before deploying, but it probably uses the same APIs as the databrick terraform module.
l
dbx takes care of putting it in the correct location and configuring your job to point at the artifact. It just uses the Databricks API to do it. So the end state is probably the same, your artifact ends up in the Databricks managed S3 with a job pointing at it.
In the databricks pants plugin I'd likely want to do the same thing, use
databricks
cli to upload artifacts and then configure the jobs with that location.
f
I'm definitely down to collaborate. Pants can already handle packaging job artifacts. The missing piece is publishing artifacts somewhere databricks support (ie S3) plus a some kind of "deploy" target. Thus far I've only used pants for building/publishing docker containers. Once you publish a container to a registry there's already amazing tooling for getting that container deployed to k8s (argocd + argocd image updater).
l
yeah so have you worked with the databricks CLI at all? My thought is to build the targets to just match the cli/API data structures, so you'd define all the job parameters in a target. Then when you run
./pants databricks deploy
or whatever, it packages the required artifacts, deploys them to S3 (in my case Azure) using databricks cli and creates the jobs in the same way. You could use macros to define different environments or test deployments.
If you want to start with adding any comments to the discussion, we can go from there. I still need to finish the plugin I'm currently working on (jupyter-book), but then I want to work on this.
h
Getting databricks to let you upload pex files is my white whale...
I would love to see it though, but it requires them to support it
It is definitely in their interest though, they don't currently have a great story for how to deploy non-trivial python code, beyond "publish a lot of wheels"
f
They have some notes on using pex files here. Still need to test them out
l
I've done it based on some instructions that a Databricks engineer shared with me. you can do it, but it's hacky.
You could definitely upload the pex file, no problem. The problem is running it as a job because databricks only supports certain job types and pex isn't one of them.
You'd basically need to setup a base
run_pex
job, and you pass in the pex file location as an argument.
... which is actually totally doable. It would be basically the same as running the pex file directly, you just have a python script for example that does it.
I was definitely surprised at how poor the CICD is for such a large platform. Even with a small team working across environments is very difficult. I'm super excited to build this to be honest.
b
Reducing the pip install time will absolutely decrease the Databricks revenue.
😂 1
You could definitely upload the pex file, no problem. The problem is running it as a job because databricks only supports certain job types and pex isn't one of them.
Running a Databricks Job works fine, but it seems there is not way to make Databricks Notebook work with PEX.
https://github.com/databricks/pex Databricks forked PEX. Maybe PEX support will be included in later Databricks Runtime Release, and they will claim that the new release will help us save money of
pip install
.
FYI @sparse-night-38237
h
That fork looks like it had no activity in it for 7 years, so I wouldn't rely on it to work. My guess is it isn't actively used.
s
@loud-laptop-89838 @bored-energy-25252 what have y'all tried in terms of using a pex file as an entrypoint in a databricks job? Eric it seems like you're hinting at having a python script as an entrypoint which then uses a subprocess to execute the pex file and pass any args? And Darcy, you said it works fine - how did you set your job up to use the pex file as an entrypoint?
I'm wondering if maybe I can use the
python_wheel_task
to basically just execute a script that's part of a module in the pex file, since I think setting PYSPARK_PYTHON=<pex_file> should make everything in the pex file importable. I think I would just need to have an init script to copy the pex to the cluster and set the PYSPARK_PYTHON env var in the cluster config + set
spark.files:<path_to_pex>
spark config
sick looks like maybe I can just execute the pex file like any other python entrypoint. maybe I was overthinking this
h
Let us know if that works! Would love to document it and maybe even add an example repo
s
will do! I've got it entering the pex file at this point, just working through pointing it to the python interpreter on the cluster. I'll post what works. also very curious about the possibility of including the interpreter with scie-pants, will look into that after I get this working
l
@swift-river-73520 that's what I'd thought about, a script that executes any pex passed as an argument but it sounds like you've got a better approach already if that works!
s
@loud-laptop-89838 it's actually been a lot less straightforward than I had hoped. I spent most of Thursday and Friday trying various things. When a job is set up as a python script job it appears that Databricks wants to create a virtualenv on startup and that setting
PYSPARK_PYTHON
in the job config will cause this startup step to fail. Using a regular entrypoint script that subprocesses out to the pex file seems to fail because the spark session that gets instantiated in my pex subprocess is missing crucial config like the spark host URL, which I assume is set by Databricks during startup. I've done some work trying to create a docker image that I can run a job with which unpacks the pex file into a virtualenv but have had some issues getting Databricks to recognize / use it - I'm still hopeful I can get that approach to work as it allows quite a bit of customization. I've also tried a spark-submit job pointing to a python script as an entrypoint, distributing the pex file using
--files <pex_file>
, but this fails on startup with this message in the driver logs -
/usr/bin/env: 'python3.9': No such file or directory
- I've seen this error when trying to run my pex file locally in an environment that doesn't have a
python3.9
alias, so I suspect maybe there's a way to build my pex file so that it looks for
python
/
python3
or some specific path on startup, but so far haven't made any progress there when modifying the
search_path
in
pants.toml
, thinking maybe I need to change the interpreter_constraints but haven't gone down that rabbithole yet
alright I think I finally hacked something together but it's pretty ugly and might not be super stable. basically what I ended up doing was to create a custom docker image for my clusters using the
databricksruntime/standard:11.3-LTS
base image. In the dockerfile I create a virtualenv using the tools built into my pex file, and then copy the
site-packages
from the venv into the
/databricks/python3
venv that the
databricksruntime/standard:11.3-LTS
base image provides. then set
PYSPARK_PYTHON=/databricks/python3/bin/python
. I think now that it's kind of working I'll go back and see if I can't just create the
/databricks/python3
venv directly from the pex file - I had tried that before but there were a handful of customizations that databricks does to that environment to get it to work on a cluster. just gotta track down those steps by tracing the base docker images back from
databricksruntime/standard:11.3-LTS
and pulling in those setup steps I couldn't find any way to call the pex file directly in a databricks job, it might still be possible but I suspect it would require a custom image of some kind. it might also be possible in a
spark-submit
job, with this approach I got to the point where I thought it was running the pex file but it just couldn't find a
python3.9
alias - so maybe a custom image +
spark-submit
could work, will play with it if I get time.
q
Hi there, joining the conversation a bit late. I have been also struggling to deploy my pex jobs to databricks and I was not able to do it. On the other hand, I successfully ran the job by packaging the job and its internal dependencies in separated wheels. I created a python_wheel_job and uploaded the dependencies, added the entry point and it worked. Jus to highlight I manually uploaded the dependencies, now the struggle will be how to automate the full process, but I guess publishing the different wheels to an internal repository and point databricks to look in that index will do. Any advice is welcome!