Hi Everyone, have question around pex, Thanks for ...
# pex
f
Hi Everyone, have question around pex, Thanks for the help in advance Is there a way to install pex file instead of bootstrap a pex environment? Here is the context, we have a lot of application build around Apache Beam . Currently we can package our application along with their dependency in a pex file and run the beam pipeline locally. However, when we want to submit this to Google Dataflow, we cannot simple run pex file inside Dataflow as it’s a managed service. The worker entry point is control by Google. It seems there is no easy way to do bootstrap when they launch the worker . Dataflow only support install dependency from
requirement.txt
tar.gz
or
wheel
. Any recommendations for this use case? In an ideal case, would prefer to build a single pex that can run locally and later submit to cloud.
There are some levels of customization we can do for dataflow worker, but the key problem here is I don’t have a easy way to bootstrap this in dataflow worker python processes
h
Let me think about this for a moment.
By "install a pex" you mean treat it like a "fat wheel" - you pip-install it into a virtualenv, and it splats out everything that's inside it?
f
Yes “fat wheel” is another good way to think about this 😄
h
You could build your code into a wheel, using
./pants setup-py
, does that not work? Its metadata will list the right dependencies.
f
Will this generated a fat wheel (includes dependencies in the wheel?) or this is only includes the source and pointers to dependencies
I think to some extends a fat wheel may be something I can start with, not sure if there is a way to do it from pants directly. Or a source wheel with frozen dependences may be better?
h
That would generate a regular wheel, but that wheel's metadata contains references to its dependencies (including other wheels built from your code). So it would be like any other wheel, which it sounds like Google Dataflow supports.
You'd have to publish your wheel I guess? Or, how does Google Dataflow work?
f
Google will upload wheels to GCS and download in worker to do a pip install
I think in this case, the key here is to make sure I can generated same environment as what pex has locally, that’s our pain point. Our pipeline success locally but failed at cloud due to dependency issues
h
I see
You said that they support a
tar.gz
file
what would that file's format be?
f
it’s python source distribution.
h
ah, so what about its dependencies?
f
If the tar.gz declared dependencies, it will install along with these dependacies. Basically I believe dataflow is doing
pip install xx.tar.gz
behind it. This doc have a good explanation around this https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Another thing may be helpful here is some how figure out list of dependencies in pex and make them as pined requirments.txt
h
OK, so basically Dataflow takes dists or requirements.txt and installs them
sdist, wheel or just requirements
You can generate regular sdists and wheels from your python code, using Pants, but you might have to publish dists for your shared internal deps, which you may not want to do?
What I mean is, say you have: ``````
Copy code
job1/BUILD:

python_library(
  dependencies='some/lib',
  provides=setup_py(name='cruise.job1', version='1.2.3')
)

job2/BUILD:

python_library(
  dependencies='some/lib',
  provides=setup_py(name='cruise.job2', version='4.5.6')
)

some/lib/BUILD:

python_library(
  provides=setup_py(name='cruise.lib', version='7.8.9' 
)
Then the wheel you build from
job1
will have
cruise.lib==7.8.9
in its requirements (Pants figures that out for you), so
cruise.lib
needs to be published somewhere Dataflow can see it, even if
cruise.job1
is not published but just uploaded directly.
cruise.lib
becomes a regular external requirement to it, see what I mean?
And there might be many of these, depending on your codebase structure. Note that you can publish code from multiple
python_library
targets in a single dist (a lib will be published in the dist provided by the closest ancestor that depends on it and has a
provides=
clause). So you don't need one-dist-per-target, that would be nuts. But you do still need to publish those inner dists, which is probably not what you want here?
So let me read up on that Dataflow link and see what I can think of
f
Hi @happy-kitchen-89482 thanks ! Let me try this to make sure I understand the exact behavior . I think I can publish package into our internal artifactory and download it from dataflow workers also cc @calm-artist-46894
h
Great
The way to think about it is this:
You have N
python_library
targets in your repo.
Some subset of those, of size K, are "exported", that is they have a
provides=
stanza.
Each exported target can generate a dist containing not just its own sources and deps, but the sources (and deps) of some of the other N targets.
For a
python_library
target T, how do we decide which target ET exports it?
We look at all the exported targets that 1. Depend on T (directly or indirectly). 2. Are in an ancestor directory of T (or are its sibling). And we take the closest ancestor of those.
This creates an unambiguous mapping from the N targets to the K exported targets. A target T can either: 1. Have a single ET that meets the criteria above. 2. Have more than one (because two sibling exported targets depend on it), in which case we error. 3. Have no ET that meets those criteria, in which case we error.
But note that we only error if you actually try to run setup-py on something that depends on that target.
It's fine to have whole parts of your codebase that aren't exported at all
as long as no exported code depends on them
Does all that make sense?
f
Just to be honest, I don’t feel 100% I get this . Let me do some experiment on this will get back to you. Thank you!
h
Happy to explain more clearly. I'll also write the docs for this soon.
💯 1