spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonas Shomorony <js...@stanford.edu>
Subject Python Dependencies Issue on EMR
Date Fri, 14 Sep 2018 02:08:38 GMT
Hey everyone,


I am currently trying to run a Python Spark job (using YARN client mode)
that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
I create a dependencies.zip file that contains all of the
dependencies/libraries (installed through pip) for the job to run
successfully, such as pandas, scipy, tqdm, psycopg2, etc. The
dependencies.zip file is contained within an outside directory (let’s call
it “project”) that contains all the code to run my Spark job. I then zip up
everything within project (including dependencies.zip) into project.zip.
Then, I call spark-submit on the master node in my EMR cluster as follows:


`spark-submit --packages … --py-files project.zip --jars ... run_command.py`


Within “run_command.py” I add dependencies.zip as follows:

`self.spark.sparkContext.addPyFile("dependencies.zip”)`


The run_command.py then uses other files within project.zip to complete the
spark job, and within those files, I import various libraries (found in
dependencies.zip).


I am running into a strange issue where all of the libraries are imported
correctly (with no problems) with the exception of scipy and pandas.


For scipy I get the following error:


`File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in
<module>

  File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line
1, in <module>

ImportError: cannot import name _ccallback_c`


And for pandas I get this error message:


`File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, in
<module>

ImportError: C extension: No module named tslib not built. If you want to
import pandas from the source directory, you may need to run 'python
setup.py build_ext --inplace --force' to build the C extensions first.`


When I comment out the imports for these two libraries (and their use from
within the code) everything works fine.


Surprisingly, when I run the application locally (on master node) without
passing in dependencies.zip, it picks and resolves the libraries from
site-packages correctly and successfully runs to completion.
dependencies.zip is created by zipping the contents of site-packages.


Does anyone have any ideas as to what may be happening here? I would really
appreciate it.


pip version: 18.0

spark version: 2.3.1

python version: 2.7


Thank you,


Jonas

Mime
View raw message