spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject [0.9.0] Possible deadlock in shutdown hook?
Date Thu, 06 Feb 2014 10:20:56 GMT
Hi Spark devs,

Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one of
my workers goes dead in the spark master UI.  I'm using the standalone
cluster and didn't ever see this while using 0.8.0 so I think it may be a
regression.

When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack and
jmap -heap, it doesn't respond unless I add the -F force flag.  The heap
isn't full, but there are some interesting bits in the jstack.  Poking
around a little, I think there may be some kind of deadlock in the shutdown
hooks.

Below are the threads I think are most interesting:

Thread 14308: (state = BLOCKED)
 - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
 - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame)
 - java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame)
 -
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object,
scala.Function1) @bci=352, line=81 (Interpreted frame)
 - akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25, line=498
(Interpreted frame)
 - akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39, line=456
(Interpreted frame)
 - akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237
(Interpreted frame)
 - akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame)
 - akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()
@bci=4, line=386 (Interpreted frame)
 - scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260
(Compiled frame)
 -
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask)
@bci=10, line=1339 (Compiled frame)
 -
scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
@bci=11, line=1979 (Compiled frame)
 - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107
(Interpreted frame)

Thread 3865: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame)
 - java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame)
 - java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106
(Interpreted frame)
 - java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46 (Interpreted
frame)
 - java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame)
 - java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame)
 - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame)
 - java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=52
(Interpreted frame)
 - sun.misc.Signal$1.run() @bci=8, line=212 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


Thread 3987: (state = BLOCKED)
 - java.io.UnixFileSystem.list(java.io.File) @bci=0 (Interpreted frame)
 - java.io.File.list() @bci=29, line=1116 (Interpreted frame)
 - java.io.File.listFiles() @bci=1, line=1201 (Compiled frame)
 - org.apache.spark.util.Utils$.listFilesSafely(java.io.File) @bci=1,
line=466 (Interpreted frame)
 - org.apache.spark.util.Utils$.deleteRecursively(java.io.File) @bci=9,
line=478 (Compiled frame)
 -
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(java.io.File)
@bci=4, line=479 (Compiled frame)
 -
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(java.lang.Object)
@bci=5, line=478 (Compiled frame)
 -
scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
scala.Function1) @bci=22, line=33 (Compiled frame)
 - scala.collection.mutable.WrappedArray.foreach(scala.Function1) @bci=2,
line=34 (Compiled frame)
 - org.apache.spark.util.Utils$.deleteRecursively(java.io.File) @bci=19,
line=478 (Interpreted frame)
 -
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(java.io.File)
@bci=14, line=141 (Interpreted frame)
 -
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(java.lang.Object)
@bci=5, line=139 (Interpreted frame)
 -
scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
scala.Function1) @bci=22, line=33 (Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1) @bci=2,
line=108 (Interpreted frame)
 - org.apache.spark.storage.DiskBlockManager$$anon$1.run() @bci=39,
line=139 (Interpreted frame)


I think what happened here is that thread 14308 received the akka
"shutdown" message and called System.exit().  This started thread 3865,
which is the JVM shutting itself down.  Part of that process is running the
shutdown hooks, so it started thread 3987.  That thread is the shutdown
hook from addShutdownHook() in DiskBlockManager.scala, which looks like
this:

  private def addShutdownHook() {
    localDirs.foreach(localDir => Utils.registerShutdownDeleteDir(localDir))
    Runtime.getRuntime.addShutdownHook(new Thread("delete Spark local
dirs") {
      override def run() {
        logDebug("Shutdown hook called")
        localDirs.foreach { localDir =>
          try {
            if (!Utils.hasRootAsShutdownDeleteDir(localDir))
Utils.deleteRecursively(localDir)
          } catch {
            case t: Throwable =>
              logError("Exception while deleting local spark dir: " +
localDir, t)
          }
        }

        if (shuffleSender != null) {
          shuffleSender.stop()
        }
      }
    })
  }

It goes through and deletes the directories recursively.  I was thinking
there might be some issues with concurrently-running shutdown hooks
deleting things out from underneath each other (shutdown hook javadocs say
they're all started in parallel if multiple hooks are added) causing the
File.list() in that last thread to take quite some time.

While I was looking through the stacktrace the JVM finally exited (after
15-20min at least) so I won't be able to debug more until this bug strikes
again.

Any ideas on what might be going on here?

Thanks!
Andrew

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message