spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed
Date Fri, 05 Oct 2018 04:53:52 GMT
Hi Marcelo,

I see what you mean. Tried it but still got same error message.

Error from python worker:
>   Traceback (most recent call last):
>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in _run_module_as_main
>       mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in _get_module_details
>       __import__(pkg_name)
>     File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 46,
in <module>
>     File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line 29,
in <module>
>   ModuleNotFoundError: No module named 'py4j'
> PYTHONPATH was:
>   /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk3/yarn/usercache/jianshi.huang/filecache/134/__spark_libs__8468485589501316413.zip/spark-core_2.11-2.3.2.jar
>
>
On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin <vanzin@cloudera.com> wrote:

> Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
> expanded by the shell).
>
> But it's really weird to be setting SPARK_HOME in the environment of
> your node managers. YARN shouldn't need to know about that.
> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
> >
> >
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
> >
> > The code shows Spark will try to find the path if SPARK_HOME is
> specified. And on my worker node, SPARK_HOME is specified in .bashrc , for
> the pre-installed 2.2.1 path.
> >
> > I don't want to make any changes to worker node configuration, so any
> way to override the order?
> >
> > Jianshi
> >
> > On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <vanzin@cloudera.com>
> wrote:
> >>
> >> Normally the version of Spark installed on the cluster does not
> >> matter, since Spark is uploaded from your gateway machine to YARN by
> >> default.
> >>
> >> You probably have some configuration (in spark-defaults.conf) that
> >> tells YARN to use a cached copy. Get rid of that configuration, and
> >> you can use whatever version you like.
> >> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I have a problem using multiple versions of Pyspark on YARN, the
> driver and worker nodes are all preinstalled with Spark 2.2.1, for
> production tasks. And I want to use 2.3.2 for my personal EDA.
> >> >
> >> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(),
> however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
> >> >
> >> > Anyone knows how to override the PYTHONPATH on worker nodes?
> >> >
> >> > Here's the error message,
> >> >>
> >> >>
> >> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
> >> >> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2):
> org.apache.spark.SparkException:
> >> >> Error from python worker:
> >> >> Traceback (most recent call last):
> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in
> _run_module_as_main
> >> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in
> _get_module_details
> >> >> __import__(pkg_name)
> >> >> File
> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line
> 46, in <module>
> >> >> File
> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line
> 29, in <module>
> >> >> ModuleNotFoundError: No module named 'py4j'
> >> >> PYTHONPATH was:
> >> >>
> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
> >> >
> >> >
> >> > And here's how I started Pyspark session in Jupyter.
> >> >>
> >> >>
> >> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
> >> >> %env PYSPARK_PYTHON=/usr/bin/python3
> >> >> import findspark
> >> >> findspark.init()
> >> >> import pyspark
> >> >> sparkConf = pyspark.SparkConf()
> >> >> sparkConf.setAll([
> >> >>     ('spark.cores.max', '96')
> >> >>     ,('spark.driver.memory', '2g')
> >> >>     ,('spark.executor.cores', '4')
> >> >>     ,('spark.executor.instances', '2')
> >> >>     ,('spark.executor.memory', '4g')
> >> >>     ,('spark.network.timeout', '800')
> >> >>     ,('spark.scheduler.mode', 'FAIR')
> >> >>     ,('spark.shuffle.service.enabled', 'true')
> >> >>     ,('spark.dynamicAllocation.enabled', 'true')
> >> >> ])
> >> >> py_files =
> ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
> >> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client",
> conf=sparkConf, pyFiles=py_files)
> >> >>
> >> >
> >> >
> >> > Thanks,
> >> > --
> >> > Jianshi Huang
> >> >
> >>
> >>
> >> --
> >> Marcelo
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
>
>
>
> --
> Marcelo
>


-- 
Jianshi Huang

Mime
View raw message