spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Vanzin <van...@cloudera.com.INVALID>
Subject Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed
Date Thu, 04 Oct 2018 16:10:48 GMT
Normally the version of Spark installed on the cluster does not
matter, since Spark is uploaded from your gateway machine to YARN by
default.

You probably have some configuration (in spark-defaults.conf) that
tells YARN to use a cached copy. Get rid of that configuration, and
you can use whatever version you like.
On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <jianshi.huang@gmail.com> wrote:
>
> Hi,
>
> I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes
are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my
personal EDA.
>
> I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker
node, the PYTHONPATH still uses the system SPARK_HOME.
>
> Anyone knows how to override the PYTHONPATH on worker nodes?
>
> Here's the error message,
>>
>>
>> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage
0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492,
executor 2): org.apache.spark.SparkException:
>> Error from python worker:
>> Traceback (most recent call last):
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>> __import__(pkg_name)
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46,
in <module>
>> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29,
in <module>
>> ModuleNotFoundError: No module named 'py4j'
>> PYTHONPATH was:
>> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>
>
> And here's how I started Pyspark session in Jupyter.
>>
>>
>> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> %env PYSPARK_PYTHON=/usr/bin/python3
>> import findspark
>> findspark.init()
>> import pyspark
>> sparkConf = pyspark.SparkConf()
>> sparkConf.setAll([
>>     ('spark.cores.max', '96')
>>     ,('spark.driver.memory', '2g')
>>     ,('spark.executor.cores', '4')
>>     ,('spark.executor.instances', '2')
>>     ,('spark.executor.memory', '4g')
>>     ,('spark.network.timeout', '800')
>>     ,('spark.scheduler.mode', 'FAIR')
>>     ,('spark.shuffle.service.enabled', 'true')
>>     ,('spark.dynamicAllocation.enabled', 'true')
>> ])
>> py_files = ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", conf=sparkConf,
pyFiles=py_files)
>>
>
>
> Thanks,
> --
> Jianshi Huang
>


-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message