Thanks < happy kitchen 89482> I m considering it for my smal Pants #welcome

Thanks <@U051221NF>. I'm considering it for my sma...

loud-laptop-89838

12/07/2022, 1:39 AM

Thanks @happy-kitchen-89482. I'm considering it for my small DataOps team. Our stack is all Azure currently. On top of the typical lint/format/test, my hope is to use it to manage databricks deployments using dbx, building docs using jupyter book, and deploying azure function apps. I know those integrations don't exist yet, so if you know of any plugins that have done it, or just tips for building out plugins in addition to what's in the docs, it would be greatly appreciated! One of the main things I want to figure out with the plugins is what fits best in Pants, and what fits best in a script/cicd. For example, if I have a long-running pipeline integration test the process might be as follows: Deploy a subset of test pipelines, launch those pipelines concurrently, monitor the status and wait for them all to complete, if all pipelines pass, deploy the production pipelines, restart production pipelines. I can do those things currently using dbx/databricks cli with a bit of scripting in Python that is managed in an Azure Devops pipeline. I would be curious to know whether and where Pants could fit into a process like that and how to approach creating a plugin if it would fit.

👋 5

bored-energy-25252

12/07/2022, 3:32 AM

manage databricks deployments using dbx,

Here is the related work: https://github.com/da-tubi/pants-pyspark-pex

bored-energy-25252

12/07/2022, 3:32 AM

We are using Databricks too. And using PEX will help us save time on pip install.

happy-kitchen-89482

12/07/2022, 7:25 AM

Does databricks now allow you to deploy a pex? Open-source spark does, but I was under the impression that databricks does not

bored-energy-25252

12/07/2022, 7:28 AM

I haven’t tested, but I guess it does. But I have to install pex by init scripts.

bored-energy-25252

12/07/2022, 7:29 AM

For PyTorch, Tensorflow,

pip install

talks a long time. Using PEX will help us save costs.

busy-vase-39202

12/07/2022, 9:33 AM

@bored-energy-25252 would your https://github.com/da-tubi/jupyter-notebook-best-practice also be relevant for Eric?

loud-laptop-89838

12/07/2022, 11:27 AM

Yea @bored-energy-25252 shared that with me in #general !

loud-laptop-89838

12/07/2022, 11:30 AM

That's good to know @bored-energy-25252. Do you know off hand whether dbx will recognize a pex file in the dist folder? Or you are saying that you install using an init script so you'd upload the file and then install on the cluster, rather than installing from a private repo?

loud-laptop-89838

12/07/2022, 11:42 AM

Because currently dbx will build and deploy my package automatically using poetry, but i think it's deployed as a wheel file

happy-kitchen-89482

12/07/2022, 5:03 PM

Let us know in #general how this goes? We had other users tell us that databricks doesn't support directly deploying a pex (even though oss spark does), and I had been trying to talk to folks I know at databricks about adding support for this. But if it works already...

loud-laptop-89838

12/07/2022, 5:07 PM

Will do. I have a contact at databricks that I can talk to about it too. I think the issue is that dbx currently supports setuptools and poetry. Would just need to add pants maybe? I could create an issue in dbx at some point but I don't fully understand yet how it would be implemented...

loud-laptop-89838

12/07/2022, 5:08 PM

But it would work if I just marked the file to upload to dbfs, then attached an init script to my pipelines. It's just as nice as a native build tool.

happy-kitchen-89482

12/07/2022, 5:09 PM

Yeah they would just need to support Pex as a format, instead of a dist. And it is a good idea for them, since pex is self-contained, and a distribution is very much not...

happy-kitchen-89482

12/07/2022, 5:09 PM

We are happy to work with them on this

loud-laptop-89838

12/07/2022, 5:09 PM

So would it be at the dbx package level or actual databricks platform?

loud-laptop-89838

12/07/2022, 5:11 PM

And then would it be supporting pants as a build tool at the dbx level?

busy-vase-39202

12/07/2022, 5:50 PM

May I interject? If you've got a contact over there that you can personally speak to, might it be possible to get you, them, and Benjy into a brief meeting together to figure out what the best approach would be? As Benjy mentioned, we're always happy to collaborate. And ideally it'd be a solution that both projects feel comfortable maintaining.

busy-vase-39202

12/07/2022, 5:52 PM

Having a driving use case like yours can be really helpful for coming up with a design that makes sense for everyone.

loud-laptop-89838

12/07/2022, 7:38 PM

Sure. Let's see what I can do. Is the ask that Pants would like to better integrate with Databricks and to have a conversation along those lines to figure out the best approach?

busy-vase-39202

12/07/2022, 7:46 PM

I'll leave that to you and @happy-kitchen-89482 to articulate. But seems to me it's legit to express the need you raised here, and that our team would love to collaborate with both you and Databricks to come up with a solution to a need that others have told us they share with you.

happy-kitchen-89482

12/07/2022, 9:03 PM

I think it starts with "ingest .pex files" and then Pants can add a "publish pexes to databricks" plugin

🎉 1

loud-laptop-89838

12/07/2022, 9:03 PM

I reached out to my contact. Let me know what is the best way to follow up

happy-kitchen-89482

12/07/2022, 9:05 PM

Feel free to DM me when they get back to you!

busy-vase-39202

12/07/2022, 10:09 PM

Good luck, folks! I hope something exciting comes of it <fingers crossed>...

busy-vase-39202

12/07/2022, 10:14 PM

@happy-kitchen-89482 is there any open ticket about this? Databricks might find it helpful to see previous requests, to get a better picture of the need and that it's been voiced by more than one organization.

faint-dress-64989

01/12/2023, 3:32 PM

In the process of doing a wide search for "databricks" I came across this thread. I'm looking to move our spark jobs (whl deployed on databricks) over to our pants repo. Did anyone every figure out if databricks supports pex files? I also assumed I'd need to continue using my custom scripts for deployments (s3 upload + terraform). As far as I know pants does not support s3 or terraform as publish targets.

loud-laptop-89838

01/12/2023, 3:34 PM

@faint-dress-64989 check out the discussion here: https://github.com/pantsbuild/pants/discussions/17802 Not much there yet, but building a databricks plugin is on my medium term roadmap. I'm hoping to get to it over the next month or two, finish in 4-6 months? It'll definitely help if you're interested in lending a hand.

loud-laptop-89838

01/12/2023, 3:37 PM

The idea would be to do something like the existing

pex_binary

targets, but for jobs and pipelines. Individually only the job/pipeline requirements and deploy them all in parallel using the databricks REST API. Similar to what their

dbx

package does now, but with individual targets instead of a single large deployment.

loud-laptop-89838

01/12/2023, 3:41 PM

not sure if that works for your workflow, but add your comments to the discussion if you think it'll help. Otherwise you could probably just create a custom plugin that essentially subclasses the wheel binary targets and adds an S3 upload option.

faint-dress-64989

01/12/2023, 3:49 PM

My workflow is to upload whl file to s3 then deploy via terraform. I'm not sure where dbx expects your job artifact to be before deploying, but it probably uses the same APIs as the databrick terraform module.

loud-laptop-89838

01/12/2023, 3:51 PM

dbx takes care of putting it in the correct location and configuring your job to point at the artifact. It just uses the Databricks API to do it. So the end state is probably the same, your artifact ends up in the Databricks managed S3 with a job pointing at it.

loud-laptop-89838

01/12/2023, 3:51 PM

In the databricks pants plugin I'd likely want to do the same thing, use

databricks

cli to upload artifacts and then configure the jobs with that location.

faint-dress-64989

01/12/2023, 4:00 PM

I'm definitely down to collaborate. Pants can already handle packaging job artifacts. The missing piece is publishing artifacts somewhere databricks support (ie S3) plus a some kind of "deploy" target. Thus far I've only used pants for building/publishing docker containers. Once you publish a container to a registry there's already amazing tooling for getting that container deployed to k8s (argocd + argocd image updater).

loud-laptop-89838

01/12/2023, 4:53 PM

yeah so have you worked with the databricks CLI at all? My thought is to build the targets to just match the cli/API data structures, so you'd define all the job parameters in a target. Then when you run

./pants databricks deploy

or whatever, it packages the required artifacts, deploys them to S3 (in my case Azure) using databricks cli and creates the jobs in the same way. You could use macros to define different environments or test deployments.

loud-laptop-89838

01/12/2023, 4:53 PM

If you want to start with adding any comments to the discussion, we can go from there. I still need to finish the plugin I'm currently working on (jupyter-book), but then I want to work on this.

happy-kitchen-89482

01/12/2023, 5:29 PM

Getting databricks to let you upload pex files is my white whale...

happy-kitchen-89482

01/12/2023, 5:29 PM

I would love to see it though, but it requires them to support it

happy-kitchen-89482

01/12/2023, 5:29 PM

It is definitely in their interest though, they don't currently have a great story for how to deploy non-trivial python code, beyond "publish a lot of wheels"

faint-dress-64989

01/12/2023, 6:18 PM

They have some notes on using pex files here. Still need to test them out

loud-laptop-89838

01/12/2023, 6:47 PM

I've done it based on some instructions that a Databricks engineer shared with me. you can do it, but it's hacky.

loud-laptop-89838

01/12/2023, 6:48 PM

You could definitely upload the pex file, no problem. The problem is running it as a job because databricks only supports certain job types and pex isn't one of them.

loud-laptop-89838

01/12/2023, 6:48 PM

You'd basically need to setup a base

run_pex

job, and you pass in the pex file location as an argument.

loud-laptop-89838

01/12/2023, 6:50 PM

... which is actually totally doable. It would be basically the same as running the pex file directly, you just have a python script for example that does it.

loud-laptop-89838

01/12/2023, 11:47 PM

I was definitely surprised at how poor the CICD is for such a large platform. Even with a small team working across environments is very difficult. I'm super excited to build this to be honest.

bored-energy-25252

01/13/2023, 5:05 AM

Reducing the pip install time will absolutely decrease the Databricks revenue.

😂 1

bored-energy-25252

01/13/2023, 5:06 AM

You could definitely upload the pex file, no problem. The problem is running it as a job because databricks only supports certain job types and pex isn't one of them.

Running a Databricks Job works fine, but it seems there is not way to make Databricks Notebook work with PEX.

bored-energy-25252

01/13/2023, 5:16 AM

https://github.com/databricks/pex Databricks forked PEX. Maybe PEX support will be included in later Databricks Runtime Release, and they will claim that the new release will help us save money of

pip install

bored-energy-25252

01/13/2023, 5:17 AM

FYI @sparse-night-38237

happy-kitchen-89482

01/13/2023, 11:41 AM

That fork looks like it had no activity in it for 7 years, so I wouldn't rely on it to work. My guess is it isn't actively used.

swift-river-73520

04/06/2023, 8:16 PM

@loud-laptop-89838 @bored-energy-25252 what have y'all tried in terms of using a pex file as an entrypoint in a databricks job? Eric it seems like you're hinting at having a python script as an entrypoint which then uses a subprocess to execute the pex file and pass any args? And Darcy, you said it works fine - how did you set your job up to use the pex file as an entrypoint?

swift-river-73520

04/06/2023, 8:25 PM

I'm wondering if maybe I can use the

python_wheel_task

to basically just execute a script that's part of a module in the pex file, since I think setting PYSPARK_PYTHON=<pex_file> should make everything in the pex file importable. I think I would just need to have an init script to copy the pex to the cluster and set the PYSPARK_PYTHON env var in the cluster config + set

spark.files:<path_to_pex>

spark config

swift-river-73520

04/06/2023, 8:55 PM

sick looks like maybe I can just execute the pex file like any other python entrypoint. maybe I was overthinking this

happy-kitchen-89482

04/06/2023, 8:58 PM

Let us know if that works! Would love to document it and maybe even add an example repo

swift-river-73520

04/06/2023, 9:34 PM

will do! I've got it entering the pex file at this point, just working through pointing it to the python interpreter on the cluster. I'll post what works. also very curious about the possibility of including the interpreter with scie-pants, will look into that after I get this working

loud-laptop-89838

04/09/2023, 9:42 AM

@swift-river-73520 that's what I'd thought about, a script that executes any pex passed as an argument but it sounds like you've got a better approach already if that works!

swift-river-73520

04/10/2023, 5:04 PM

@loud-laptop-89838 it's actually been a lot less straightforward than I had hoped. I spent most of Thursday and Friday trying various things. When a job is set up as a python script job it appears that Databricks wants to create a virtualenv on startup and that setting

PYSPARK_PYTHON

in the job config will cause this startup step to fail. Using a regular entrypoint script that subprocesses out to the pex file seems to fail because the spark session that gets instantiated in my pex subprocess is missing crucial config like the spark host URL, which I assume is set by Databricks during startup. I've done some work trying to create a docker image that I can run a job with which unpacks the pex file into a virtualenv but have had some issues getting Databricks to recognize / use it - I'm still hopeful I can get that approach to work as it allows quite a bit of customization. I've also tried a spark-submit job pointing to a python script as an entrypoint, distributing the pex file using

--files <pex_file>

, but this fails on startup with this message in the driver logs -

/usr/bin/env: 'python3.9': No such file or directory

- I've seen this error when trying to run my pex file locally in an environment that doesn't have a

python3.9

alias, so I suspect maybe there's a way to build my pex file so that it looks for

python

python3

or some specific path on startup, but so far haven't made any progress there when modifying the

search_path

pants.toml

, thinking maybe I need to change the interpreter_constraints but haven't gone down that rabbithole yet

swift-river-73520

04/10/2023, 11:15 PM

alright I think I finally hacked something together but it's pretty ugly and might not be super stable. basically what I ended up doing was to create a custom docker image for my clusters using the

databricksruntime/standard:11.3-LTS

base image. In the dockerfile I create a virtualenv using the tools built into my pex file, and then copy the

site-packages

from the venv into the

/databricks/python3

venv that the

databricksruntime/standard:11.3-LTS

base image provides. then set

PYSPARK_PYTHON=/databricks/python3/bin/python

. I think now that it's kind of working I'll go back and see if I can't just create the

/databricks/python3

venv directly from the pex file - I had tried that before but there were a handful of customizations that databricks does to that environment to get it to work on a cluster. just gotta track down those steps by tracing the base docker images back from

databricksruntime/standard:11.3-LTS

and pulling in those setup steps I couldn't find any way to call the pex file directly in a databricks job, it might still be possible but I suspect it would require a custom image of some kind. it might also be possible in a

spark-submit

job, with this approach I got to the point where I thought it was running the pex file but it just couldn't find a

python3.9

alias - so maybe a custom image +

spark-submit

could work, will play with it if I get time.

quick-midnight-96336

04/26/2023, 8:34 PM

Hi there, joining the conversation a bit late. I have been also struggling to deploy my pex jobs to databricks and I was not able to do it. On the other hand, I successfully ran the job by packaging the job and its internal dependencies in separated wheels. I created a python_wheel_job and uploaded the dependencies, added the entry point and it worked. Jus to highlight I manually uploaded the dependencies, now the struggle will be how to automate the full process, but I guess publishing the different wheels to an internal repository and point databricks to look in that index will do. Any advice is welcome!

Open in Slack

Previous Next