spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Song <chen.song...@gmail.com>
Subject spark job hung up
Date Sat, 20 Sep 2014 09:07:58 GMT
I am testing my spark job on yarn

spark: 1.0.0-cdh5.1.0
yarn: cdh5.1.0

Once a while the spark job hung up (stuck in some stage without any
progress on driver and executors) after some failures. Below is the list of
typical failures on driver and executor.

** on master/driver*
14/09/16 06:42:28 WARN TaskSetManager: Loss was due to fetch failure from
null
14/09/16 06:42:28 INFO DAGScheduler: Marking Stage 0 (saveAsSequenceFile at
FraudUsers.scala:144) for resubmision due to a fetch failure
14/09/16 06:42:28 INFO DAGScheduler: The failed fetch was from Stage 1
(reduceByKey at FraudUsers.scala:105); marking it for resubmission
14/09/16 06:42:28 ERROR LiveListenerBus: Listener EventLoggingListener
threw an exception
java.lang.NullPointerException
at
org.apache.spark.util.JsonProtocol$.blockManagerIdToJson(JsonProtocol.scala:267)
at
org.apache.spark.util.JsonProtocol$.taskEndReasonToJson(JsonProtocol.scala:249)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:103)
at
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:52)
at
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:84)
at
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:102)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$7.apply(SparkListenerBus.scala:58)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$7.apply(SparkListenerBus.scala:58)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
at
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:58)
at
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)

*On executor*
14/09/16 06:42:15 WARN SendingConnection: Error finishing connection to
369.bm-hadoopc-datanode.prod.lax1/10.0.81.19:42251
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/09/16 06:42:15 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(369.bm-hadoopc-datanode.prod.lax1,42251)
14/09/16 06:42:15 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(369.bm-hadoopc-datanode.prod.lax1,42251)
14/09/16 06:42:15 INFO ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@257cbf16
14/09/16 06:42:15 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
Could not get block(s) from
ConnectionManagerId(369.bm-hadoopc-datanode.prod.lax1,42251)

** on driver/master*
14/09/16 06:42:14 WARN TaskSetManager: Loss was due to fetch failure from
BlockManagerId(767, 369.bm-hadoopc-datanode.prod.lax1, 42251, 0)

** on exectutor, 369.bm-hadoopc-datanode.prod.lax1*
14/09/16 06:48:22 INFO BlockManager: BlockManager re-registering with master
14/09/16 06:48:22 INFO BlockManagerMaster: Trying to register BlockManager
14/09/16 06:48:22 INFO BlockManagerMaster: Registered BlockManager
14/09/16 06:48:22 INFO BlockManager: Reporting 63 blocks to the master.


-- 
Chen Song

Mime
View raw message