I have a naive question. I'm trying to run some co...
# general
s
I have a naive question. I'm trying to run some code on a ray cluster. In order to get this to work I need to tell ray a little bit about the packages I use. I need to essentially pass a dict like {"pip": [pip requirements], "env_vars": {...}, "working_dir": "."} to
ray.init(runtime_env=...)
I have an "entrypoint"
item-rank/src/item_rank/publish.py
so I can do something like
./pants dependencies --transitive item-rank/src/item_rank/publish.py
. If I keep all of my requirements in a single
requirements.txt
file I can parse the output and get something like
Copy code
"db-dtypes~=1.0.4",
"gcsfs~=2022.10.0",
"google-cloud-bigquery-storage==2.16.0",
from
Copy code
//:reqs#db-dtypes
//:reqs#gcsfs
//:reqs#google-cloud-bigquery
with a little bit bash golf. How might I translate
Copy code
archipelago/src/archipelago/foo.py
capstan/src/capstan/bar.py
into env vars like
PYTHONPATH=$PYTHONPATH:archipelago/src:capstan/src
e.g. the relevant source roots (I suppose this is just "./pants roots"). I've done this manually and it seems to work (I'm running into permission issues so there are still errors that will take until Monday to resolve, but it seems like this is a viable route) Edit - I never really asked a question, this comment is really more that I'm curious how other folks have approached integrating with ray. I feel like I'm traveling down the wrong path here yet I should be able to query pants about everything I need to construct an environment for my ray workers. A better solution would be if ray understood pex but that doesn't appear to be a supported feature yet
h
As you said,
pants roots
is what you want for the PYTHONPATH
and you can get the reqs with something like
Copy code
pants dependencies --transitive  path/to/file.py  | \
  xargs pants list --filter-target-type=python_requirement | \
  xargs pants peek | \
  jq .[].requirements
The
peek
goal gives you detailed info about each input target
s
Thanks
This is awesome. I added a little bit to reduce everything into a single array
Copy code
./pants dependencies --transitive item-rank/src/item_rank/publish.py | \
  xargs ./pants list --filter-target-type=python_requirement | \
  xargs ./pants peek | jq '[.[] | .requirements[]] | reduce .[] as $item ([]; . + [$item])'
I wonder if it would make sense to wrap this up as a command and have a
experimental_shell_command
dependency
h
Neat
You could if this is something you'll need regularly
it would be nice if you could deploy a single Pex to ray instead of having to tell ray how to build your code...
s
There's https://github.com/ray-project/ray/issues/15518 but I don't think I have the skill / time to contribute to that project
h
The fact that ray, and databricks, and aws lambda, and gcp cloud functions, all want you to provide raw requirement and entry point metadata to deploy python, tells you how immature the python deployment story still is
Hah, yes
Exactly
s
To databrick's credit, pyspark supports pex. Unfortunately my team has a huge aversion to spark / pyspark
h
Pyspark does but databricks doesn't, for some reason
g
Yusuf thanks for sharing, we also use ray and databricks. We don’t have a solution for ray yet, but we do publish pex files to use with databricks. We use some custom init scripts to inject some code in the python site packages to init the pex environment.
s
I did a little digging on this. Ray allows you to configure runtime environments via
RuntimeEnvPlugin
. They use this internally for implementing
pip
https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/pip.py#L387,
conda
https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/conda.py#L257 etc
I haven't dug into the code except at a superficial level, but it seems like a
PexPlugin
utilizing the same strategy might be possible.
👀 3