Hello :wave: I am having a problem running a unit...
# general
r
Hello 👋 I am having a problem running a unit test in
pyspark
that uses a user-defined function (UDF). I’ve created a small toy repo to replicate this problem: https://github.com/skyxie/pants-test The
README.md
has the reproduction steps, but to break it down 1. In
conftest.py
there is a
spark
fixture to create a local spark session. 2. In
src/my_module/__init__.py
there are 2 UDFs, one that does not include a 3rd-party dependency and one that does. 3rd-party dependencies are specified in
3rdparty/requirements.txt
3. In
test/test_my_module.py
I have unit tests for each of the 2 UDFs. If run these unit tests using
PYTHONPATH=src pytest test/test_my_module.py
then both tests pass, but if I run them using
./pants test ::
, then the UDF that uses the 3rd-party dependency fails. The error I get on my local MacOSX is cryptic
Copy code
22/11/30 23:14:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (ip-192-168-0-26.ec2.internal executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
But if I run
./pants test ::
in a docker container with
python
and
openjdk8
installed, then I get a more interesting error:
Copy code
22/12/01 04:32:15 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (6f12a8103121 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  ... # Truncated for brevity
  File "/root/.cache/pants/named_caches/pex_root/venvs/ccde778f33dcd85107fc1c69e63d0aece19f2091/1f95d470b7a8fe245d4aa248c6b523b37b0dc73f/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/root/.cache/pants/named_caches/pex_root/venvs/ccde778f33dcd85107fc1c69e63d0aece19f2091/1f95d470b7a8fe245d4aa248c6b523b37b0dc73f/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'tldextract'
which suggests that the spark worker attempts to load a pickle, but cannot import a module referenced in the pickle. My hypothesis for why this happens is that spark session spawns a worker that is in a different python virtualenvironment from the one running in pants and therefore cannot load the dependencies.
1
b
r
That did it, thanks you so much! ❤️ You would not believe how much time I spent looking into this 😅
h
Thanks so much Darcy!
We should write a short blog post or tip on the docsite for this!
r
Hi, Bumping up this issue again because @lively-zebra-24587 is hitting this same issue and the fix, which works for me on my intel MacBook, does not work for them on their M1 MacBooks. This fix also works in a docker container. Is there any reason this might work for one architecture, but not another? I’m currently in the process of setting up an M1 laptop to replicate, but this will take a while and I was wondering if there were any known issues.
1
b
Just an fyi that most of the maintainers are based in North America, so I expect you'll be hearing from someone pretty soon.
❤️ 1
h
Sorry for the trouble! @bored-energy-25252 any ideas?
@rich-london-74860 we'd need a lot more details to debug this.
r
@lively-zebra-24587 Can you chime in here with more details? @happy-kitchen-89482 what do you need to know?
l
Copy code
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
this is the Java version I have locally
the tests that are failing all use this fixture, with the fix suggested last time
Copy code
@pytest.fixture(scope="session", name="spark")
def _spark():
    with patch.dict(
        os.environ,
        {
            "PYSPARK_PYTHON": sys.executable,
            "PYSPARK_DRIVER_PYTHON": sys.executable,
            **dict(os.environ),
        },
    ):
        session = (
            SparkSession.builder.master("local")
            .appName("test_databricks")
            .getOrCreate()
        )
        session.sparkContext.setLogLevel("WARN")

        yield session

        session.stop()
h
What is the error you're seeing
And does this identical thing pass on an x86 mac?
r
On another slack, @lively-zebra-24587 shared this with me
Copy code
Exception: Python in worker has different version 3.11 than that in driver 3.9, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Quite different from the original error that I reported, but I remember seeing Tom’s error as well when I was working through this. The fix works on x86 and docker containers, but not the M1
l
Yes that's right. I noticed that the only tests that fail have a dependency on the
tldextract
library, which I see in the logs Tian shared at the start of this thread
This is now solved - the issue was that I did not have a global version set for
pyenv
. Once I set it it to
3.9.16
everything works fine.
👍 1
r
The fact that this problem only happened on M1 laptops was a coincidence. It looks like if you are using
pyspark
and
pyenv
, then it’s necessary to set a global default version for
pyenv
. I’m not sure if it needs to be 3.9 because my global default is actually 3.8.
b
PySpark works perfectly with Pants, here is my project: https://github.com/komprenilo/liga