Hi. Thanks to <@U04S45AHA>, I was able to resolve ...
# general
Hi. Thanks to @enough-analyst-54434, I was able to resolve my pyspark issue. However, adding a constraints file caused another. Using a constraints file causes
ModuleNotFoundError: No module named 'pyarrow'
when running tests. Repo reproducing the issue: https://github.com/adityav/pants-python-tryouts
Without constraints file, hellospark_test works correctly.
Copy code
➜ ./pants test helloworld/sparkjob/hellospark_test.py
19:14:59.81 [INFO] Completed: Building 4 requirements for requirements.pex from the python-default.lock resolve: pandas==1.5.1, pyarrow==6.0.1, pyspark[sql]==3.3.1, pytest==6.2.5
19:15:01.36 [INFO] Completed: Building pytest_runner.pex
19:15:11.84 [INFO] Completed: Run Pytest - helloworld/sparkjob/hellospark_test.py:tests succeeded.

✓ helloworld/sparkjob/hellospark_test.py:tests succeeded in 10.35s.
On adding a constraints file in `pants.toml`:
Copy code
python-default = "constraints-3.10.txt"
Getting error:
Copy code
➜ ./pants generate-lockfiles                          
19:17:41.08 [INFO] Initializing scheduler...
19:17:41.40 [INFO] Scheduler initialized.
19:18:01.38 [INFO] Completed: Generate lockfile for python-default
19:18:01.39 [INFO] Wrote lockfile for the resolve `python-default` to python-default.lock
./pants test helloworld/sparkjob/hellospark_test.py
19:19:25.82 [ERROR] Completed: Run Pytest - helloworld/sparkjob/hellospark_test.py:tests failed (exit code 2).
============================= test session starts ==============================
platform darwin -- Python 3.10.9, pytest-7.0.1, pluggy-1.0.0
rootdir: /private/var/folders/0t/dmh8ynt13pbc2y2stvb0by6c0000gn/T/pants-sandbox-aA4jsU
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 0 items / 1 error

==================================== ERRORS ====================================
___________ ERROR collecting helloworld/sparkjob/hellospark_test.py ____________
ImportError while importing test module '/private/var/folders/0t/dmh8ynt13pbc2y2stvb0by6c0000gn/T/pants-sandbox-aA4jsU/helloworld/sparkjob/hellospark_test.py'.
Hint: make sure your test modules/packages have valid Python names.
/usr/local/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
helloworld/sparkjob/hellospark_test.py:1: in <module>
    from helloworld.sparkjob import hellospark
helloworld/sparkjob/hellospark.py:5: in <module>
    import pyarrow as pa
E   ModuleNotFoundError: No module named 'pyarrow'
- generated xml file: /private/var/folders/0t/dmh8ynt13pbc2y2stvb0by6c0000gn/T/pants-sandbox-aA4jsU/helloworld.sparkjob.hellospark_test.py.tests.xml -
=========================== short test summary info ============================
ERROR helloworld/sparkjob/hellospark_test.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 0.82s ===============================

✕ helloworld/sparkjob/hellospark_test.py:tests failed in 1.65s.
pyarrow is shown as a dependency, so not sure whats causing it:
Copy code
➜ ./pants dependencies --transitive helloworld/sparkjob/hellospark_test.py
so I solved it by adding the following to
Copy code
version = "pytest==6.2.5"
lockfile = "pytest.lock"
I don’t know how how pytest and constraints file interact, but I noticed the tests were using
, while the constraints file specified 6.2.5.
Aha, I (handwave) think that makes sense. We run tests by ~ `PEX_PATH=requirements.pex PEX_EXTRA_SYS_PATH=src/ pytest.pex`in a sandbox with all your tests and source under
. The key thing being the pytest tool PEX and the 3rdparty requirements PEX are seperate. Spark does complicated things and if it loads pytest from the one PEX and not the other, I expect that affects what it can see in terms of other dependencies. By aligning the pytest versions, you don't force spark to look in the wrong PEX for other dependencies. Again - a handwave. There are lots of details there to pn down and prove.
In short, if that is in the ballpark, this is a Pants bug, and its unfortunately in the long and growing list of bugs due to performance hacks. The only reason we build a tool PEX for pytest separate from your requirements PEX is to save time, and its at the expense of ~correctness / uniformity / predictability.
man, python packaging is such a mess. whether it be pyspark, tensorflow, pytorch, or ray, there is always some issue.
Sometimes I feel like going back to venv + docker. It won’t solve anything, but there is a greater chance to find a stackoverflow thread on how to solve it 😐
Stack overflow may get you on you way sometimes but it also makes us all dumber. It's worth slowing down and digging in and understanding the ecosystem if you're stuck with it.