Hello wave I am having a problem running a unit test in `pys Pants #general

Hello :wave: I am having a problem running a unit...

rich-london-74860

12/01/2022, 4:34 AM

Hello 👋 I am having a problem running a unit test in

pyspark

that uses a user-defined function (UDF). I’ve created a small toy repo to replicate this problem: https://github.com/skyxie/pants-test The

README.md

has the reproduction steps, but to break it down 1. In

conftest.py

there is a

spark

fixture to create a local spark session. 2. In

src/my_module/__init__.py

there are 2 UDFs, one that does not include a 3rd-party dependency and one that does. 3rd-party dependencies are specified in

3rdparty/requirements.txt

3. In

test/test_my_module.py

I have unit tests for each of the 2 UDFs. If run these unit tests using

PYTHONPATH=src pytest test/test_my_module.py

then both tests pass, but if I run them using

./pants test ::

, then the UDF that uses the 3rd-party dependency fails. The error I get on my local MacOSX is cryptic

Copy code

22/11/30 23:14:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (ip-192-168-0-26.ec2.internal executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)

But if I run

./pants test ::

in a docker container with

python

and

openjdk8

installed, then I get a more interesting error:

Copy code

22/12/01 04:32:15 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (6f12a8103121 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  ... # Truncated for brevity
  File "/root/.cache/pants/named_caches/pex_root/venvs/ccde778f33dcd85107fc1c69e63d0aece19f2091/1f95d470b7a8fe245d4aa248c6b523b37b0dc73f/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/root/.cache/pants/named_caches/pex_root/venvs/ccde778f33dcd85107fc1c69e63d0aece19f2091/1f95d470b7a8fe245d4aa248c6b523b37b0dc73f/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'tldextract'

which suggests that the spark worker attempts to load a pickle, but cannot import a module referenced in the pickle. My hypothesis for why this happens is that spark session spawns a worker that is in a different python virtualenvironment from the one running in pants and therefore cannot load the dependencies.

✅ 1

bored-energy-25252

12/01/2022, 6:06 AM

You need this two line before creating a PySpark Session: https://github.com/da-tubi/jupyter-notebook-best-practice/blob/26bbbe0919b8753550634205a4d97c84c2a80729/notebooks/helper.py#L6-L7

🙌 2

❤️ 1

✅ 2

rich-london-74860

12/01/2022, 2:15 PM

That did it, thanks you so much! ❤️ You would not believe how much time I spent looking into this 😅

happy-kitchen-89482

12/01/2022, 6:01 PM

Thanks so much Darcy!

happy-kitchen-89482

12/01/2022, 6:01 PM

We should write a short blog post or tip on the docsite for this!

rich-london-74860

03/27/2023, 10:41 PM

Hi, Bumping up this issue again because @lively-zebra-24587 is hitting this same issue and the fix, which works for me on my intel MacBook, does not work for them on their M1 MacBooks. This fix also works in a docker container. Is there any reason this might work for one architecture, but not another? I’m currently in the process of setting up an M1 laptop to replicate, but this will take a while and I was wondering if there were any known issues.

✅ 1

busy-vase-39202

03/28/2023, 1:35 PM

Just an fyi that most of the maintainers are based in North America, so I expect you'll be hearing from someone pretty soon.

❤️ 1

happy-kitchen-89482

03/28/2023, 6:24 PM

Sorry for the trouble! @bored-energy-25252 any ideas?

happy-kitchen-89482

03/28/2023, 6:24 PM

@rich-london-74860 we'd need a lot more details to debug this.

rich-london-74860

03/28/2023, 6:37 PM

@lively-zebra-24587 Can you chime in here with more details? @happy-kitchen-89482 what do you need to know?

lively-zebra-24587

03/28/2023, 10:11 PM

Copy code

openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)

this is the Java version I have locally

lively-zebra-24587

03/28/2023, 10:13 PM

the tests that are failing all use this fixture, with the fix suggested last time

Copy code

@pytest.fixture(scope="session", name="spark")
def _spark():
    with patch.dict(
        os.environ,
        {
            "PYSPARK_PYTHON": sys.executable,
            "PYSPARK_DRIVER_PYTHON": sys.executable,
            **dict(os.environ),
        },
    ):
        session = (
            SparkSession.builder.master("local")
            .appName("test_databricks")
            .getOrCreate()
        )
        session.sparkContext.setLogLevel("WARN")

        yield session

        session.stop()

happy-kitchen-89482

03/28/2023, 10:57 PM

What is the error you're seeing

happy-kitchen-89482

03/28/2023, 10:57 PM

And does this identical thing pass on an x86 mac?

rich-london-74860

03/28/2023, 11:02 PM

On another slack, @lively-zebra-24587 shared this with me

Copy code

Exception: Python in worker has different version 3.11 than that in driver 3.9, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

rich-london-74860

03/28/2023, 11:39 PM

Quite different from the original error that I reported, but I remember seeing Tom’s error as well when I was working through this. The fix works on x86 and docker containers, but not the M1

lively-zebra-24587

03/29/2023, 12:03 AM

Yes that's right. I noticed that the only tests that fail have a dependency on the

tldextract

library, which I see in the logs Tian shared at the start of this thread

lively-zebra-24587

03/29/2023, 12:16 PM

This is now solved - the issue was that I did not have a global version set for

pyenv

. Once I set it it to

3.9.16

everything works fine.

👍 1

rich-london-74860

03/29/2023, 1:27 PM

The fact that this problem only happened on M1 laptops was a coincidence. It looks like if you are using

pyspark

and

pyenv

, then it’s necessary to set a global default version for

pyenv

. I’m not sure if it needs to be 3.9 because my global default is actually 3.8.

bored-energy-25252

03/29/2023, 3:56 PM

PySpark works perfectly with Pants, here is my project: https://github.com/komprenilo/liga

3 Views

Open in Slack

Previous Next