Hi Everyone have question around pex Thanks for the help in Pants #pex

Hi Everyone, have question around pex, Thanks for ...

fresh-tomato-66784

01/29/2020, 8:59 PM

Hi Everyone, have question around pex, Thanks for the help in advance Is there a way to install pex file instead of bootstrap a pex environment? Here is the context, we have a lot of application build around Apache Beam . Currently we can package our application along with their dependency in a pex file and run the beam pipeline locally. However, when we want to submit this to Google Dataflow, we cannot simple run pex file inside Dataflow as it’s a managed service. The worker entry point is control by Google. It seems there is no easy way to do bootstrap when they launch the worker . Dataflow only support install dependency from

requirement.txt

tar.gz

wheel

. Any recommendations for this use case? In an ideal case, would prefer to build a single pex that can run locally and later submit to cloud.

fresh-tomato-66784

01/29/2020, 9:08 PM

There are some levels of customization we can do for dataflow worker, but the key problem here is I don’t have a easy way to bootstrap this in dataflow worker python processes

happy-kitchen-89482

01/29/2020, 10:19 PM

Let me think about this for a moment.

happy-kitchen-89482

01/29/2020, 10:20 PM

By "install a pex" you mean treat it like a "fat wheel" - you pip-install it into a virtualenv, and it splats out everything that's inside it?

fresh-tomato-66784

01/29/2020, 10:21 PM

Yes “fat wheel” is another good way to think about this 😄

happy-kitchen-89482

01/29/2020, 10:21 PM

You could build your code into a wheel, using

./pants setup-py

, does that not work? Its metadata will list the right dependencies.

fresh-tomato-66784

01/29/2020, 10:24 PM

Will this generated a fat wheel (includes dependencies in the wheel?) or this is only includes the source and pointers to dependencies

fresh-tomato-66784

01/29/2020, 10:30 PM

I think to some extends a fat wheel may be something I can start with, not sure if there is a way to do it from pants directly. Or a source wheel with frozen dependences may be better?

happy-kitchen-89482

01/29/2020, 11:10 PM

That would generate a regular wheel, but that wheel's metadata contains references to its dependencies (including other wheels built from your code). So it would be like any other wheel, which it sounds like Google Dataflow supports.

happy-kitchen-89482

01/29/2020, 11:10 PM

You'd have to publish your wheel I guess? Or, how does Google Dataflow work?

fresh-tomato-66784

01/29/2020, 11:31 PM

Google will upload wheels to GCS and download in worker to do a pip install

fresh-tomato-66784

01/29/2020, 11:50 PM

I think in this case, the key here is to make sure I can generated same environment as what pex has locally, that’s our pain point. Our pipeline success locally but failed at cloud due to dependency issues

happy-kitchen-89482

01/30/2020, 1:07 AM

I see

happy-kitchen-89482

01/30/2020, 1:07 AM

You said that they support a

tar.gz

file

happy-kitchen-89482

01/30/2020, 1:07 AM

what would that file's format be?

fresh-tomato-66784

01/30/2020, 1:53 AM

it’s python source distribution.

happy-kitchen-89482

01/30/2020, 2:52 AM

ah, so what about its dependencies?

fresh-tomato-66784

01/30/2020, 7:15 AM

If the tar.gz declared dependencies, it will install along with these dependacies. Basically I believe dataflow is doing

pip install xx.tar.gz

behind it. This doc have a good explanation around this https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

fresh-tomato-66784

01/30/2020, 9:31 AM

Another thing may be helpful here is some how figure out list of dependencies in pex and make them as pined requirments.txt

happy-kitchen-89482

01/30/2020, 7:02 PM

OK, so basically Dataflow takes dists or requirements.txt and installs them

happy-kitchen-89482

01/30/2020, 7:02 PM

sdist, wheel or just requirements

happy-kitchen-89482

01/30/2020, 7:05 PM

You can generate regular sdists and wheels from your python code, using Pants, but you might have to publish dists for your shared internal deps, which you may not want to do?

happy-kitchen-89482

01/30/2020, 7:05 PM

What I mean is, say you have: ``````

happy-kitchen-89482

01/30/2020, 7:09 PM

Copy code

job1/BUILD:

python_library(
  dependencies='some/lib',
  provides=setup_py(name='cruise.job1', version='1.2.3')
)

job2/BUILD:

python_library(
  dependencies='some/lib',
  provides=setup_py(name='cruise.job2', version='4.5.6')
)

some/lib/BUILD:

python_library(
  provides=setup_py(name='cruise.lib', version='7.8.9' 
)

happy-kitchen-89482

01/30/2020, 7:10 PM

Then the wheel you build from

job1

will have

cruise.lib==7.8.9

in its requirements (Pants figures that out for you), so

cruise.lib

needs to be published somewhere Dataflow can see it, even if

cruise.job1

is not published but just uploaded directly.

cruise.lib

becomes a regular external requirement to it, see what I mean?

happy-kitchen-89482

01/30/2020, 7:12 PM

And there might be many of these, depending on your codebase structure. Note that you can publish code from multiple

python_library

targets in a single dist (a lib will be published in the dist provided by the closest ancestor that depends on it and has a

provides=

clause). So you don't need one-dist-per-target, that would be nuts. But you do still need to publish those inner dists, which is probably not what you want here?

happy-kitchen-89482

01/30/2020, 7:13 PM

So let me read up on that Dataflow link and see what I can think of

fresh-tomato-66784

01/30/2020, 9:41 PM

Hi @happy-kitchen-89482 thanks ! Let me try this to make sure I understand the exact behavior . I think I can publish package into our internal artifactory and download it from dataflow workers also cc @calm-artist-46894

happy-kitchen-89482

01/30/2020, 9:42 PM

Great

happy-kitchen-89482

01/30/2020, 9:42 PM

The way to think about it is this:

happy-kitchen-89482

01/30/2020, 9:42 PM

You have N

python_library

targets in your repo.

happy-kitchen-89482

01/30/2020, 9:42 PM

Some subset of those, of size K, are "exported", that is they have a

provides=

stanza.

happy-kitchen-89482

01/30/2020, 9:44 PM

Each exported target can generate a dist containing not just its own sources and deps, but the sources (and deps) of some of the other N targets.

happy-kitchen-89482

01/30/2020, 9:44 PM

For a

python_library

target T, how do we decide which target ET exports it?

happy-kitchen-89482

01/30/2020, 9:46 PM

We look at all the exported targets that 1. Depend on T (directly or indirectly). 2. Are in an ancestor directory of T (or are its sibling). And we take the closest ancestor of those.

happy-kitchen-89482

01/30/2020, 9:47 PM

This creates an unambiguous mapping from the N targets to the K exported targets. A target T can either: 1. Have a single ET that meets the criteria above. 2. Have more than one (because two sibling exported targets depend on it), in which case we error. 3. Have no ET that meets those criteria, in which case we error.

happy-kitchen-89482

01/30/2020, 9:47 PM

But note that we only error if you actually try to run setup-py on something that depends on that target.

happy-kitchen-89482

01/30/2020, 9:48 PM

It's fine to have whole parts of your codebase that aren't exported at all

happy-kitchen-89482

01/30/2020, 9:48 PM

as long as no exported code depends on them

happy-kitchen-89482

01/30/2020, 9:48 PM

Does all that make sense?

fresh-tomato-66784

01/30/2020, 11:21 PM

Just to be honest, I don’t feel 100% I get this . Let me do some experiment on this will get back to you. Thank you!

happy-kitchen-89482

01/30/2020, 11:47 PM

Happy to explain more clearly. I'll also write the docs for this soon.

💯 1

2 Views

Open in Slack

Previous Next