crooked-country-1937
01/22/2023, 3:47 PM./pants run
gives ModuleNotFoundError: No module named 'pandas'
.
here is a repo with the issue: https://github.com/adityav/pants-python-tryouts./pants run helloworld/sparkjob/hellospark.py
pandas is in the dependency tree.
./pants dependencies --transitive helloworld/sparkjob/hellospark.py
//:reqs#pandas
//:reqs#pyspark
//requirements.txt:reqs
โ ./pants run helloworld/sparkjob/hellospark.py
22:02:37.90 [INFO] Initializing scheduler...
22:02:38.22 [INFO] Scheduler initialized.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/22 22:02:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/22 22:02:55 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 16) 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/avishwakarma/.cache/pants/named_caches/pex_root/installed_wheels/878b260bb4d3ee05745c118426887764855ed824d3fd63fe5648c52326d8d32e/pyspark-3.3.1-py2.py3-none-any.whl/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 670, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/Users/avishwakarma/.cache/pants/named_caches/pex_root/installed_wheels/878b260bb4d3ee05745c118426887764855ed824d3fd63fe5648c52326d8d32e/pyspark-3.3.1-py2.py3-none-any.whl/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 507, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/Users/avishwakarma/.cache/pants/named_caches/pex_root/installed_wheels/878b260bb4d3ee05745c118426887764855ed824d3fd63fe5648c52326d8d32e/pyspark-3.3.1-py2.py3-none-any.whl/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 289, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/Users/avishwakarma/.cache/pants/named_caches/pex_root/installed_wheels/878b260bb4d3ee05745c118426887764855ed824d3fd63fe5648c52326d8d32e/pyspark-3.3.1-py2.py3-none-any.whl/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 85, in read_command
command = serializer._read_with_length(file)
File "/Users/avishwakarma/.cache/pants/named_caches/pex_root/installed_wheels/878b260bb4d3ee05745c118426887764855ed824d3fd63fe5648c52326d8d32e/pyspark-3.3.1-py2.py3-none-any.whl/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/Users/avishwakarma/.cache/pants/named_caches/pex_root/installed_wheels/878b260bb4d3ee05745c118426887764855ed824d3fd63fe5648c52326d8d32e/pyspark-3.3.1-py2.py3-none-any.whl/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'pandas'
enough-analyst-54434
01/22/2023, 9:46 PMcrooked-country-1937
01/22/2023, 10:10 PMhelloworld/sparkjob/hellospark_test.py
enough-analyst-54434
01/22/2023, 11:23 PMcrooked-country-1937
01/23/2023, 1:54 PMif __name__ == "__main__":
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
hello_spark(SparkSession.builder.getOrCreate())
Have no idea why it works, but it does.
Unfortunately, ran into same issue when I added a constraints file. Going to create another thread for it