Hi everyone, I’m used to using GCP’s dataflow for...
# general
s
Hi everyone, I’m used to using GCP’s dataflow for data workflows, but a project I’m working on is using AWS instead. I think ultimately I’d like to use EMR and PySpark for this, but there’s a lot of moving pieces with EMR, Spark, Docker, Pants/Pex, etc. What’s the best way to do this at a high level? I’m guessing there’s something with
./pants package
that lets me bundle everything together, and then use EMR’s Docker support to run my code remotely? I’m not sure how to create an environment for the workers, although I assume there’s a way to install the Pants repo into the docker such that it’s available for the workers? I think if I wasn’t using Pants, what I would do is setup a docker container such that my repo has been
pip install
’d, that way all of the imports work for the Python.
h
Hi, I know close to nothing about EMR or Spark, but
./pants package
can build a docker image that contains your python code and its requirements, all ready to run. See this blog post for a lot of detail on how to do this efficiently (you may not care about the performance issues at first, in which case you can experiment with just the simpler examples in that post)
Is that what you were looking for?
c
Spark supports pex directly, so I would avoid all the docker jazz and send it pex files. It is very similar to JAR experience
h
Ah yeah, I forgot about that!
s
Ahhh these are great resources, thank you! I think I still need Docker since I need Python 3.9 for my various requirements, but that should be not a huge deal