spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed
Date Thu, 04 Oct 2018 19:34:59 GMT
Hi Marcelo,
it will be great if you illustrate what you mean, I will be interested to
know.

Hi Jianshi,
so just to be sure you want to work on SPARK 2.3 while having SPARK 2.1
installed in your cluster?

Regards,
Gourav Sengupta

On Thu, Oct 4, 2018 at 6:26 PM Marcelo Vanzin <vanzin@cloudera.com.invalid>
wrote:

> Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
> expanded by the shell).
>
> But it's really weird to be setting SPARK_HOME in the environment of
> your node managers. YARN shouldn't need to know about that.
> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
> >
> >
> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
> >
> > The code shows Spark will try to find the path if SPARK_HOME is
> specified. And on my worker node, SPARK_HOME is specified in .bashrc , for
> the pre-installed 2.2.1 path.
> >
> > I don't want to make any changes to worker node configuration, so any
> way to override the order?
> >
> > Jianshi
> >
> > On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <vanzin@cloudera.com>
> wrote:
> >>
> >> Normally the version of Spark installed on the cluster does not
> >> matter, since Spark is uploaded from your gateway machine to YARN by
> >> default.
> >>
> >> You probably have some configuration (in spark-defaults.conf) that
> >> tells YARN to use a cached copy. Get rid of that configuration, and
> >> you can use whatever version you like.
> >> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I have a problem using multiple versions of Pyspark on YARN, the
> driver and worker nodes are all preinstalled with Spark 2.2.1, for
> production tasks. And I want to use 2.3.2 for my personal EDA.
> >> >
> >> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(),
> however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.
> >> >
> >> > Anyone knows how to override the PYTHONPATH on worker nodes?
> >> >
> >> > Here's the error message,
> >> >>
> >> >>
> >> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
> >> >> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2):
> org.apache.spark.SparkException:
> >> >> Error from python worker:
> >> >> Traceback (most recent call last):
> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in
> _run_module_as_main
> >> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in
> _get_module_details
> >> >> __import__(pkg_name)
> >> >> File
> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line
> 46, in <module>
> >> >> File
> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line
> 29, in <module>
> >> >> ModuleNotFoundError: No module named 'py4j'
> >> >> PYTHONPATH was:
> >> >>
> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
> >> >
> >> >
> >> > And here's how I started Pyspark session in Jupyter.
> >> >>
> >> >>
> >> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
> >> >> %env PYSPARK_PYTHON=/usr/bin/python3
> >> >> import findspark
> >> >> findspark.init()
> >> >> import pyspark
> >> >> sparkConf = pyspark.SparkConf()
> >> >> sparkConf.setAll([
> >> >>     ('spark.cores.max', '96')
> >> >>     ,('spark.driver.memory', '2g')
> >> >>     ,('spark.executor.cores', '4')
> >> >>     ,('spark.executor.instances', '2')
> >> >>     ,('spark.executor.memory', '4g')
> >> >>     ,('spark.network.timeout', '800')
> >> >>     ,('spark.scheduler.mode', 'FAIR')
> >> >>     ,('spark.shuffle.service.enabled', 'true')
> >> >>     ,('spark.dynamicAllocation.enabled', 'true')
> >> >> ])
> >> >> py_files =
> ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
> >> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client",
> conf=sparkConf, pyFiles=py_files)
> >> >>
> >> >
> >> >
> >> > Thanks,
> >> > --
> >> > Jianshi Huang
> >> >
> >>
> >>
> >> --
> >> Marcelo
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message