spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonas Shomorony <>
Subject Python Dependencies Issue on EMR
Date Fri, 14 Sep 2018 02:08:38 GMT
Hey everyone,

I am currently trying to run a Python Spark job (using YARN client mode)
that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
I create a file that contains all of the
dependencies/libraries (installed through pip) for the job to run
successfully, such as pandas, scipy, tqdm, psycopg2, etc. The file is contained within an outside directory (let’s call
it “project”) that contains all the code to run my Spark job. I then zip up
everything within project (including into
Then, I call spark-submit on the master node in my EMR cluster as follows:

`spark-submit --packages … --py-files --jars ...`

Within “” I add as follows:


The then uses other files within to complete the
spark job, and within those files, I import various libraries (found in

I am running into a strange issue where all of the libraries are imported
correctly (with no problems) with the exception of scipy and pandas.

For scipy I get the following error:

`File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/", line 119, in

  File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/", line
1, in <module>

ImportError: cannot import name _ccallback_c`

And for pandas I get this error message:

`File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/", line 35, in

ImportError: C extension: No module named tslib not built. If you want to
import pandas from the source directory, you may need to run 'python build_ext --inplace --force' to build the C extensions first.`

When I comment out the imports for these two libraries (and their use from
within the code) everything works fine.

Surprisingly, when I run the application locally (on master node) without
passing in, it picks and resolves the libraries from
site-packages correctly and successfully runs to completion. is created by zipping the contents of site-packages.

Does anyone have any ideas as to what may be happening here? I would really
appreciate it.

pip version: 18.0

spark version: 2.3.1

python version: 2.7

Thank you,


View raw message