spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com.INVALID>
Subject Re: Python Dependencies Issue on EMR
Date Fri, 14 Sep 2018 15:02:12 GMT
You didn't say how you're zipping the dependencies, but I'm guessing you
either include .egg files or zipped up a virtualenv. In either case, the
extra C stuff that scipy and pandas rely upon doesn't get included.

An approach like this solved the last problem I had that seemed like this -
https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html

On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony <jshom@stanford.edu>
wrote:

> Hey everyone,
>
>
> I am currently trying to run a Python Spark job (using YARN client mode)
> that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
> I create a dependencies.zip file that contains all of the
> dependencies/libraries (installed through pip) for the job to run
> successfully, such as pandas, scipy, tqdm, psycopg2, etc. The
> dependencies.zip file is contained within an outside directory (let’s call
> it “project”) that contains all the code to run my Spark job. I then zip up
> everything within project (including dependencies.zip) into project.zip.
> Then, I call spark-submit on the master node in my EMR cluster as follows:
>
>
> `spark-submit --packages … --py-files project.zip --jars ...
> run_command.py`
>
>
> Within “run_command.py” I add dependencies.zip as follows:
>
> `self.spark.sparkContext.addPyFile("dependencies.zip”)`
>
>
> The run_command.py then uses other files within project.zip to complete
> the spark job, and within those files, I import various libraries (found in
> dependencies.zip).
>
>
> I am running into a strange issue where all of the libraries are imported
> correctly (with no problems) with the exception of scipy and pandas.
>
>
> For scipy I get the following error:
>
>
> `File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in
> <module>
>
>   File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line
> 1, in <module>
>
> ImportError: cannot import name _ccallback_c`
>
>
> And for pandas I get this error message:
>
>
> `File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35,
> in <module>
>
> ImportError: C extension: No module named tslib not built. If you want to
> import pandas from the source directory, you may need to run 'python
> setup.py build_ext --inplace --force' to build the C extensions first.`
>
>
> When I comment out the imports for these two libraries (and their use from
> within the code) everything works fine.
>
>
> Surprisingly, when I run the application locally (on master node) without
> passing in dependencies.zip, it picks and resolves the libraries from
> site-packages correctly and successfully runs to completion.
> dependencies.zip is created by zipping the contents of site-packages.
>
>
> Does anyone have any ideas as to what may be happening here? I would
> really appreciate it.
>
>
> pip version: 18.0
>
> spark version: 2.3.1
>
> python version: 2.7
>
>
> Thank you,
>
>
> Jonas
>
>

Mime
View raw message