rich-london-74860
12/01/2022, 4:34 AMpyspark
that uses a user-defined function (UDF).
I’ve created a small toy repo to replicate this problem:
https://github.com/skyxie/pants-test
The README.md
has the reproduction steps, but to break it down
1. In conftest.py
there is a spark
fixture to create a local spark session.
2. In src/my_module/__init__.py
there are 2 UDFs, one that does not include a 3rd-party dependency and one that does. 3rd-party dependencies are specified in 3rdparty/requirements.txt
3. In test/test_my_module.py
I have unit tests for each of the 2 UDFs.
If run these unit tests using PYTHONPATH=src pytest test/test_my_module.py
then both tests pass, but if I run them using ./pants test ::
, then the UDF that uses the 3rd-party dependency fails.
The error I get on my local MacOSX is cryptic
22/11/30 23:14:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (ip-192-168-0-26.ec2.internal executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
But if I run ./pants test ::
in a docker container with python
and openjdk8
installed, then I get a more interesting error:
22/12/01 04:32:15 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (6f12a8103121 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
... # Truncated for brevity
File "/root/.cache/pants/named_caches/pex_root/venvs/ccde778f33dcd85107fc1c69e63d0aece19f2091/1f95d470b7a8fe245d4aa248c6b523b37b0dc73f/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "/root/.cache/pants/named_caches/pex_root/venvs/ccde778f33dcd85107fc1c69e63d0aece19f2091/1f95d470b7a8fe245d4aa248c6b523b37b0dc73f/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
__import__(name)
ModuleNotFoundError: No module named 'tldextract'
which suggests that the spark worker attempts to load a pickle, but cannot import a module referenced in the pickle.
My hypothesis for why this happens is that spark session spawns a worker that is in a different python virtualenvironment from the one running in pants and therefore cannot load the dependencies.bored-energy-25252
12/01/2022, 6:06 AMrich-london-74860
12/01/2022, 2:15 PMhappy-kitchen-89482
12/01/2022, 6:01 PMrich-london-74860
03/27/2023, 10:41 PMbusy-vase-39202
03/28/2023, 1:35 PMhappy-kitchen-89482
03/28/2023, 6:24 PMrich-london-74860
03/28/2023, 6:37 PMlively-zebra-24587
03/28/2023, 10:11 PMopenjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
this is the Java version I have locally@pytest.fixture(scope="session", name="spark")
def _spark():
with patch.dict(
os.environ,
{
"PYSPARK_PYTHON": sys.executable,
"PYSPARK_DRIVER_PYTHON": sys.executable,
**dict(os.environ),
},
):
session = (
SparkSession.builder.master("local")
.appName("test_databricks")
.getOrCreate()
)
session.sparkContext.setLogLevel("WARN")
yield session
session.stop()
happy-kitchen-89482
03/28/2023, 10:57 PMrich-london-74860
03/28/2023, 11:02 PMException: Python in worker has different version 3.11 than that in driver 3.9, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
lively-zebra-24587
03/29/2023, 12:03 AMtldextract
library, which I see in the logs Tian shared at the start of this threadpyenv
. Once I set it it to 3.9.16
everything works fine.rich-london-74860
03/29/2023, 1:27 PMpyspark
and pyenv
, then it’s necessary to set a global default version for pyenv
.
I’m not sure if it needs to be 3.9 because my global default is actually 3.8.bored-energy-25252
03/29/2023, 3:56 PM