spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Miller <bmill...@eecs.berkeley.edu>
Subject Re: pyspark crash on mesos
Date Wed, 12 Mar 2014 01:40:13 GMT
Hi All,

I've had a chance to install the head version of Spark from brach 0.9
in github.  This version seems to have almost fixed the bug I
experienced, but falls critically short.  To recap, when performing
operations as simple as counting records from a large (100G) RDD with
about 130K records, I now see the error shown below.  This is the same
error as before, except now it is caught.  The end effect is that
mesos claims to have executed all tasks, and the Spark UI shows that
all tasks are complete, but the stage is remains in running and is
never moved to completed (see attached screenshot), and the call to
.count() never returns.

I have attached logs from master and slaves for both spark and mesos.
I am consistently bitten by this bug when working with larger data, so
I would appreciate any help in developing a work-around and am glad to
supply more info as necessary.

-Brad

14/03/11 14:38:50 ERROR OneForOneStrategy: EOF reached before Python
server acknowledged
org.apache.spark.SparkException: EOF reached before Python server acknowledged
        at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:314)
        at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:285)
        at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:70)
        at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:275)
        at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:273)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at org.apache.spark.Accumulators$.add(Accumulators.scala:273)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:826)
        at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:601)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

On Mon, Mar 3, 2014 at 4:53 PM, Josh Rosen <rosenville@gmail.com> wrote:
> Brad and I looked into this error and I have a few hunches about what might
> be happening.
>
> We didn't observe any failed tasks in the logs.  For some reason, the Python
> driver is failing to acknowledge an accumulator update from a
> successfully-completed task.  Our program doesn't explicitly use
> accumulators, but it looks like PySpark task results always contain a single
> Java accumulator with a PysparkAccumulatorParam.
>
> DAGScheduler doesn't catch the exception thrown by the failed addInPlace
> accumulator operation, which results in the entire DAGScheduler crashing and
> the job freezing.
>
> We should try to identify the root cause of the unacknowledged accumulator
> updates, but in the meantime it would be a good idea to add more exception
> handling to ensure that the DAGScheduler's main loop never crashes.  This
> might mask the presence of bugs like this accumulator issue, but it would
> prevent rare bugs from taking out the entire SparkContext (this is
> especially important for job servers that share a SparkContext across
> multiple jobs).  A general fix might be to add a "catch Exception" block in
> handleTaskCompletion that turns uncaught exceptions into task failures, and
> possibly a top-level "catch Exception" block in DAGScheduler.run() that
> causes any uncaught exceptions to immediately cancel/crash all running jobs
> (so we fail-fast instead of hanging).
>
> It would be nice if there was a way to selectively enable Java-style checked
> exceptions to avoid introducing these types of failure-handling bugs.
>
>
> On Mon, Mar 3, 2014 at 10:34 AM, Brad Miller <bmiller1@eecs.berkeley.edu>
> wrote:
>>
>> Hi All,
>>
>> After switching from standalone Spark to Mesos I'm experiencing some
>> instability.  I'm running pyspark interactively through iPython
>> notebook, and get this crash non-deterministically (although pretty
>> reliably in the first 2000 tasks, often much sooner).
>>
>> Exception in thread "DAGScheduler" org.apache.spark.SparkException:
>> EOF reached before Python server acknowledged
>> at
>> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:340)
>> at
>> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:311)
>> at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:70)
>> at
>> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:253)
>> at
>> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:251)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
>> at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
>> at org.apache.spark.Accumulators$.add(Accumulators.scala:251)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:662)
>> at
>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:437)
>> at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
>>
>> I'm running the following software versions on all machines:
>> Spark: 0.8.1  (md5: 5d3c56eaf91c7349886d5c70439730b3)
>> Mesos: 0.13.0  (md5: 220dc9c1db118bc7599d45631da578b9)
>> Python 2.7.3 (Stackoverflow mentioned differing python versions may be
>> to blame --- unless Spark or iPython is specifically invoking an older
>> version under the hood mine are all the same).
>> Ubuntu 12.0.4
>>
>> I've modified mesos-daemon.sh as follows:
>> I had problems launching the cluster with mesos-start-cluster.sh and
>> traced the problem to (what seemed to be) a bug in mesos-daemon.sh
>> which used a "--conf" flag that mesos-slave and mesos-master didn't
>> recognize.  I removed the flag and instead added code to read in
>> environment variables from mesos-deploy-env.sh.
>> mesos-start-cluster.sh then worked as advertised.
>>
>> Incase it's helpful, I've attached several files as follows:
>> *spark_full_output: output of ipython process where SparkContext was
>> created
>> *mesos-deploy-env.sh: mesos config file from slave (identical to
>> master except for MESOS_MASTER)
>> *spark-env.sh: spark config file
>> *mesos-master.INFO: log file from mesos-master
>> *mesos-master.WARNING: log file from mesos-master
>> *mesos-daemon.sh: my modified version of mesos-daemon.sh
>>
>> Incase anybody from Berkeley is so interested they want to interact
>> with my deployment, my office is in Soda hall so that can definitely
>> be arranged.  My apologies if anybody received a duplicate message; I
>> encountered some delays and complication while joining the list.
>>
>> -Brad Miller
>
>

Mime
View raw message