spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apostolos N. Papadopoulos" <papad...@csd.auth.gr>
Subject Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed
Date Thu, 04 Oct 2018 09:51:56 GMT
Maybe this can help.

https://stackoverflow.com/questions/32959723/set-python-path-for-spark-worker



On 04/10/2018 12:19 μμ, Jianshi Huang wrote:
> Hi,
>
> I have a problem using multiple versions of Pyspark on YARN, the 
> driver and worker nodes are all preinstalled with Spark 2.2.1, for 
> production tasks. And I want to use 2.3.2 for my personal EDA.
>
> I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), 
> however on the worker node, the PYTHONPATH still uses the system 
> SPARK_HOME.
>
> Anyone knows how to override the PYTHONPATH on worker nodes?
>
> Here's the error message,
>
>
>     Py4JJavaError: An error occurred while calling o75.collectToPython.
>     : org.apache.spark.SparkException: Job aborted due to stage
>     failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
>     Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492,
>     executor 2): org.apache.spark.SparkException:
>     Error from python worker:
>     Traceback (most recent call last):
>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183,
>     in _run_module_as_main
>     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109,
>     in _get_module_details
>     __import__(pkg_name)
>     File
>     "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py",
>     line 46, in <module>
>     File
>     "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py",
>     line 29, in <module>
>     ModuleNotFoundError: No module named 'py4j'
>     PYTHONPATH was:
>     /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>
>
> And here's how I started Pyspark session in Jupyter.
>
>
>     %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>     %env PYSPARK_PYTHON=/usr/bin/python3
>     import findspark
>     findspark.init()
>     import pyspark
>     sparkConf = pyspark.SparkConf()
>     sparkConf.setAll([
>     ('spark.cores.max', '96')
>     ,('spark.driver.memory', '2g')
>     ,('spark.executor.cores', '4')
>     ,('spark.executor.instances', '2')
>     ,('spark.executor.memory', '4g')
>     ,('spark.network.timeout', '800')
>     ,('spark.scheduler.mode', 'FAIR')
>     ,('spark.shuffle.service.enabled', 'true')
>     ,('spark.dynamicAllocation.enabled', 'true')
>     ])
>     py_files =
>     ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>     sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client",
>     conf=sparkConf, pyFiles=py_files)
>
>
>
> Thanks,
> -- 
> Jianshi Huang
>

-- 
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papadopo@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


Mime
View raw message