spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonas Shomorony <js...@stanford.edu>
Subject Re: Python Dependencies Issue on EMR
Date Fri, 21 Sep 2018 00:21:27 GMT
Thanks Patrick. Using a conda virtual environment did help with libraries
that required the extra C stuff.

Jonas

On Fri, Sep 14, 2018 at 8:02 AM Patrick McCarthy <pmccarthy@dstillery.com>
wrote:

> You didn't say how you're zipping the dependencies, but I'm guessing you
> either include .egg files or zipped up a virtualenv. In either case, the
> extra C stuff that scipy and pandas rely upon doesn't get included.
>
> An approach like this solved the last problem I had that seemed like this
> -
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>
> On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony <jshom@stanford.edu>
> wrote:
>
>> Hey everyone,
>>
>>
>> I am currently trying to run a Python Spark job (using YARN client mode)
>> that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that,
>> I create a dependencies.zip file that contains all of the
>> dependencies/libraries (installed through pip) for the job to run
>> successfully, such as pandas, scipy, tqdm, psycopg2, etc. The
>> dependencies.zip file is contained within an outside directory (let’s call
>> it “project”) that contains all the code to run my Spark job. I then zip up
>> everything within project (including dependencies.zip) into project.zip.
>> Then, I call spark-submit on the master node in my EMR cluster as follows:
>>
>>
>> `spark-submit --packages … --py-files project.zip --jars ...
>> run_command.py`
>>
>>
>> Within “run_command.py” I add dependencies.zip as follows:
>>
>> `self.spark.sparkContext.addPyFile("dependencies.zip”)`
>>
>>
>> The run_command.py then uses other files within project.zip to complete
>> the spark job, and within those files, I import various libraries (found in
>> dependencies.zip).
>>
>>
>> I am running into a strange issue where all of the libraries are imported
>> correctly (with no problems) with the exception of scipy and pandas.
>>
>>
>> For scipy I get the following error:
>>
>>
>> `File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in
>> <module>
>>
>>   File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py",
>> line 1, in <module>
>>
>> ImportError: cannot import name _ccallback_c`
>>
>>
>> And for pandas I get this error message:
>>
>>
>> `File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35,
>> in <module>
>>
>> ImportError: C extension: No module named tslib not built. If you want to
>> import pandas from the source directory, you may need to run 'python
>> setup.py build_ext --inplace --force' to build the C extensions first.`
>>
>>
>> When I comment out the imports for these two libraries (and their use
>> from within the code) everything works fine.
>>
>>
>> Surprisingly, when I run the application locally (on master node) without
>> passing in dependencies.zip, it picks and resolves the libraries from
>> site-packages correctly and successfully runs to completion.
>> dependencies.zip is created by zipping the contents of site-packages.
>>
>>
>> Does anyone have any ideas as to what may be happening here? I would
>> really appreciate it.
>>
>>
>> pip version: 18.0
>>
>> spark version: 2.3.1
>>
>> python version: 2.7
>>
>>
>> Thank you,
>>
>>
>> Jonas
>>
>>
>

Mime
View raw message