spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Rawat <Ashish.Ra...@guavus.com>
Subject Spark Application Hung
Date Tue, 24 Mar 2015 18:41:08 GMT
Hi,

We are observing a hung spark application when one of the yarn datanode (running multiple
spark executors) go down.

Setup details:

  *   Spark: 1.2.1
  *   Hadoop: 2.4.0
  *   Spark Application Mode: yarn-client
  *   2 datanodes (DN1, DN2)
  *   6 spark executors (initially 3 executors on both DN1 and DN2, after rebooting DN2, changes
to 4 executors on DN1 and 2 executors on DN2)

Scenario:

When one of the datanodes (DN2) is brought down, the application gets hung, with spark driver
continuously showing the following warning:

15/03/24 12:39:26 WARN TaskSetManager: Lost task 5.0 in stage 232.0 (TID 37941, DN1): FetchFailed(null,
shuffleId=155, mapId=-1, reduceId=5, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
155
        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
        at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
        at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
        at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)


When DN2 is brought down, one executor gets launched on DN1. When DN2 is brought back up after
15mins, 2 executors get launched on it.
All the executors (including the ones which got launched after DN2 comes back), keep showing
the following errors:

15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 155,
fetching them
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 155,
fetching them
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@NN1:44353/user/MapOutputTracker#-957394722]
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Got the output locations
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for shuffle 155
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for shuffle 155
15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 44623
15/03/24 12:43:30 INFO executor.Executor: Running task 5.0 in stage 232.960 (TID 44623)
15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 44629
15/03/24 12:43:30 INFO executor.Executor: Running task 11.0 in stage 232.960 (TID 44629)
15/03/24 12:43:30 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2275
15/03/24 12:43:30 INFO storage.MemoryStore: ensureFreeSpace(16308) called with curMem=44996,
maxMem=5556708311
15/03/24 12:43:30 INFO storage.MemoryStore: Block broadcast_2275_piece0 stored as bytes in
memory (estimated size 15.9 KB, free 5.2 GB)
15/03/24 12:43:30 INFO storage.BlockManagerMaster: Updated info of block broadcast_2275_piece0
15/03/24 12:43:30 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2275 took 97
ms
15/03/24 12:43:30 INFO storage.MemoryStore: ensureFreeSpace(28688) called with curMem=61304,
maxMem=5556708311
15/03/24 12:43:30 INFO storage.MemoryStore: Block broadcast_2275 stored as values in memory
(estimated size 28.0 KB, free 5.2 GB)
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for shuffle 155
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for shuffle 155

The newly launched executors have almost 0 memory utilisation and are stuck with the above
errors.

The driver has the following logs, just before the "Lost Task" messages begin to appear.

15/03/24 12:39:26 INFO MapOutputTrackerMasterActor: Asked to send map output locations for
shuffle 155 to sparkExecutor@DN1:35682
15/03/24 12:39:26 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 155 is
527 bytes
15/03/24 12:39:26 INFO MapOutputTrackerMasterActor: Asked to send map output locations for
shuffle 155 to sparkExecutor@DN1:34062
15/03/24 12:39:26 INFO MapOutputTrackerMasterActor: Asked to send map output locations for
shuffle 155 to sparkExecutor@DN1:45639

Before the application gets hung on shuffle id 155, there are similar warnings on other shuffle
ids which get resolved in 2-3 attempts, but it never gets resolved for shuffle 155. Is it
because the mapStatuses in MapOutputTrackerMaster in Driver has a corrupt state for shuffle
155? If yes, then how can the spark application reach such a state and aren't there suitable
steps to recover?

Can someone please help in debugging this issue. We haven't yet restarted the system as this
error may not be easily reproducible but resolving such issues is critical to take our application
to production. Would appreciate some quick help!

Regards,
Ashish



Mime
View raw message