spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lihu <lihu...@gmail.com>
Subject increase the akka.frameSize lead to Lost Executor
Date Fri, 16 May 2014 03:12:17 GMT
Hi,
    I just run the kmeans algorithm of MLlib, the size of data is about
800M. When I run into some step:* reduceByKey* operation, I found the size
of Serialized task is more than 10MB, so I change the akka.frameSize
properties  to 50MB, but after I changed this, it lead to following error:


14/05/16 02:44:12 INFO SparkDeploySchedulerBackend: Executor 23
disconnected, so removing it
14/05/16 02:44:12 ERROR TaskSchedulerImpl: Lost executor 23 on Husky043:
remote Akka client disassociated
14/05/16 02:44:12 INFO TaskSetManager: Re-queueing tasks for 23 from
TaskSet 12.0
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5628 (task 12.0:26)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5541 (task 12.0:23)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5744 (task 12.0:30)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5657 (task 12.0:27)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5570 (task 12.0:24)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5773 (task 12.0:31)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5686 (task 12.0:28)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5599 (task 12.0:25)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5802 (task 12.0:32)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5512 (task 12.0:22)
14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5715 (task 12.0:29)
1*4/05/16 02:44:12 INFO DAGScheduler: Executor lost: 23 (epoch 3)*
*14/05/16 02:44:12 INFO BlockManagerMasterActor: Trying to remove executor
23 from BlockManagerMaster.*
*14/05/16 02:44:12 INFO BlockManagerMaster: Removed 23 successfully in
removeExecutor*
*14/05/16 02:44:13 INFO AppClient$ClientActor: Executor updated:
app-20140516023701-0002/23 is now FAILED (Command exited with code 1)*
*14/05/16 02:44:13 INFO SparkDeploySchedulerBackend: Executor
app-20140516023701-0002/23 removed: Command exited with code 1*
*14/05/16 02:44:13 INFO AppClient$ClientActor: Executor added:
app-20140516023701-0002/29 on worker-20140516021952-Husky043-52882
(Husky043:52882) with 11 cores*

 and the web ui looks like this:

​

   the code is very simple:

   val data = MLUtils.loadLabeledData(sc, input_path)
   val parsedData = data.map(_.features).cache()
   val numIterations = 10
   val numClusters = 30
   val clusters = KMeans.train(parsedData, numClusters, numIterations)


   my spark version is 0.9.

=====================================================================
Here is my analysis:

①I found the following log from one of my worker node, compare the
timestamp between master and worker, I found the error occur after the
executor send the task of stage 12 to master.

2014-05-16 02:43:59,124 [Executor task launch worker-0] DEBUG
org.apache.spark.storage.BlockManager - Getting block rdd_3_386 from memory
2014-05-16 02:43:59,124 [Executor task launch worker-0] INFO
 org.apache.spark.storage.BlockManager - Found block rdd_3_386 locally
2014-05-16 02:43:59,168 [Executor task launch worker-0] INFO
 org.apache.spark.executor.Executor - Serialized size of result for 5490 is
520
2014-05-16 02:43:59,168 [Executor task launch worker-0] INFO
 org.apache.spark.executor.Executor - Sending result for 5490 directly to
driver
2014-05-16 02:43:59,168 [Executor task launch worker-0] INFO
 org.apache.spark.executor.Executor - Finished task ID 5490
2014-05-16 02:44:14,044 [delete Spark local dirs] DEBUG
org.apache.spark.storage.DiskBlockManager - Shutdown hook called
2014-05-16 02:44:14,044 [Thread-17] DEBUG org.apache.hadoop.ipc.Client -
Stopping client

②Then I wonder maybe there is some low performance in the broadcast , so I
changed the default httpBroadcast to TorrentBroadcast, but it also failed.


I want to log the akka event , but I do not familiar with akka. So any
advice or how to log akka is thankful !

Mime
View raw message