spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Davidson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-1769) Executor loss can cause race condition in Pool
Date Sat, 10 May 2014 22:14:21 GMT

     [ https://issues.apache.org/jira/browse/SPARK-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron Davidson updated SPARK-1769:
----------------------------------

    Description: 
Loss of executors (in this case due to OOMs) exposes a race condition in Pool.scala, evident
from this stack trace:

{code}
14/05/08 22:41:48 ERROR OneForOneStrategy:
java.lang.NullPointerException
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
        at org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
        at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

Note that the line of code that throws this exception is here:
{code}
schedulableQueue.foreach(_.executorLost(executorId, host))
{code}

By the stack trace, it's not schedulableQueue that is null, but an element therein. As far
as I could tell, we never add a null element to this queue. Rather, I could see that removeSchedulable()
and executorLost() were called at about the same time (via log messages), and suspect that
since this ArrayBuffer is in no way synchronized, that we iterate through the list while it's
in an incomplete state.

  was:
Loss of executors (in this case due to OOMs) exposes a race condition in Pool.scala, evident
from this stack trace:

{code}
14/05/08 22:41:48 ERROR OneForOneStrategy:
java.lang.NullPointerException
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
        at org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
        at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

Note that the line of code that throws this exception is here:
{code}
schedulableQueue.foreach(_.executorLost(executorId, host))
{code}

By the stack trace, it's not schedulableQueue that is null, but an element therein. As far
as I could tell, we never add a null element to this queue. Rather, I could see that there
removeSchedulable() and executorLost() were called at about the same time (via log messages),
and suspect that since this ArrayBuffer is in no way synchronized, that we iterate through
the list while it's in an incomplete state.


> Executor loss can cause race condition in Pool
> ----------------------------------------------
>
>                 Key: SPARK-1769
>                 URL: https://issues.apache.org/jira/browse/SPARK-1769
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Aaron Davidson
>
> Loss of executors (in this case due to OOMs) exposes a race condition in Pool.scala,
evident from this stack trace:
> {code}
> 14/05/08 22:41:48 ERROR OneForOneStrategy:
> java.lang.NullPointerException
>         at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
>         at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
>         at org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
>         at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
>         at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
>         at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
>         at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
>         at scala.Option.foreach(Option.scala:236)
>         at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> Note that the line of code that throws this exception is here:
> {code}
> schedulableQueue.foreach(_.executorLost(executorId, host))
> {code}
> By the stack trace, it's not schedulableQueue that is null, but an element therein. As
far as I could tell, we never add a null element to this queue. Rather, I could see that removeSchedulable()
and executorLost() were called at about the same time (via log messages), and suspect that
since this ArrayBuffer is in no way synchronized, that we iterate through the list while it's
in an incomplete state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message