spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Rosen <rosenvi...@gmail.com>
Subject Re: pyspark crash on mesos
Date Tue, 04 Mar 2014 00:53:17 GMT
Brad and I looked into this error and I have a few hunches about what might
be happening.

We didn't observe any failed tasks in the logs.  For some reason, the
Python driver is failing to acknowledge an accumulator update from a
successfully-completed task.  Our program doesn't explicitly use
accumulators, but it looks like PySpark task results always contain a
single Java accumulator with a PysparkAccumulatorParam.

DAGScheduler doesn't catch the exception thrown by the failed addInPlace
accumulator operation, which results in the entire DAGScheduler crashing
and the job freezing.

We should try to identify the root cause of the unacknowledged accumulator
updates, but in the meantime it would be a good idea to add more exception
handling to ensure that the DAGScheduler's main loop never crashes.  This
might mask the presence of bugs like this accumulator issue, but it would
prevent rare bugs from taking out the entire SparkContext (this is
especially important for job servers that share a SparkContext across
multiple jobs).  A general fix might be to add a "catch Exception" block in
handleTaskCompletion that turns uncaught exceptions into task failures, and
possibly a top-level "catch Exception" block in DAGScheduler.run() that
causes any uncaught exceptions to immediately cancel/crash all running jobs
(so we fail-fast instead of hanging).

It would be nice if there was a way to selectively enable Java-style
checked exceptions to avoid introducing these types of failure-handling
bugs.


On Mon, Mar 3, 2014 at 10:34 AM, Brad Miller <bmiller1@eecs.berkeley.edu>wrote:

> Hi All,
>
> After switching from standalone Spark to Mesos I'm experiencing some
> instability.  I'm running pyspark interactively through iPython
> notebook, and get this crash non-deterministically (although pretty
> reliably in the first 2000 tasks, often much sooner).
>
> Exception in thread "DAGScheduler" org.apache.spark.SparkException:
> EOF reached before Python server acknowledged
> at
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:340)
> at
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:311)
> at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:70)
> at
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:253)
> at
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:251)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> at org.apache.spark.Accumulators$.add(Accumulators.scala:251)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:662)
> at
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:437)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
> at
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
>
> I'm running the following software versions on all machines:
> Spark: 0.8.1  (md5: 5d3c56eaf91c7349886d5c70439730b3)
> Mesos: 0.13.0  (md5: 220dc9c1db118bc7599d45631da578b9)
> Python 2.7.3 (Stackoverflow mentioned differing python versions may be
> to blame --- unless Spark or iPython is specifically invoking an older
> version under the hood mine are all the same).
> Ubuntu 12.0.4
>
> I've modified mesos-daemon.sh as follows:
> I had problems launching the cluster with mesos-start-cluster.sh and
> traced the problem to (what seemed to be) a bug in mesos-daemon.sh
> which used a "--conf" flag that mesos-slave and mesos-master didn't
> recognize.  I removed the flag and instead added code to read in
> environment variables from mesos-deploy-env.sh.
> mesos-start-cluster.sh then worked as advertised.
>
> Incase it's helpful, I've attached several files as follows:
> *spark_full_output: output of ipython process where SparkContext was
> created
> *mesos-deploy-env.sh: mesos config file from slave (identical to
> master except for MESOS_MASTER)
> *spark-env.sh: spark config file
> *mesos-master.INFO: log file from mesos-master
> *mesos-master.WARNING: log file from mesos-master
> *mesos-daemon.sh: my modified version of mesos-daemon.sh
>
> Incase anybody from Berkeley is so interested they want to interact
> with my deployment, my office is in Soda hall so that can definitely
> be arranged.  My apologies if anybody received a duplicate message; I
> encountered some delays and complication while joining the list.
>
> -Brad Miller
>

Mime
View raw message