spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lihu <lihu...@gmail.com>
Subject Re: increase the akka.frameSize lead to Lost Executor
Date Tue, 20 May 2014 05:51:35 GMT
I later found this problem occurred in  the reduceByKey() operation:

1. If I set the akka.frameSize<=10MB, then there will only part of tasks
serialized in the reduceByKey stage. The driver send the laucnchTask
command to the executor, but the executor do not received this command.

2. If I set the akka.frameSize>10MB, all the task will be
serialized successfully, but will occur Lost Executor: remote Akka
client disassociated. and the stage will be aborted.


Since I have set a akka.frameSize to 4MB, and the network can send 4MB data
successfully. so I wonder this is a bug in the Spark?





On Fri, May 16, 2014 at 11:12 AM, lihu <lihu723@gmail.com> wrote:

> Hi,
>     I just run the kmeans algorithm of MLlib, the size of data is about
> 800M. When I run into some step:* reduceByKey* operation, I found the
> size of Serialized task is more than 10MB, so I change the akka.frameSize
> properties  to 50MB, but after I changed this, it lead to following error:
>
>
> 14/05/16 02:44:12 INFO SparkDeploySchedulerBackend: Executor 23
> disconnected, so removing it
> 14/05/16 02:44:12 ERROR TaskSchedulerImpl: Lost executor 23 on Husky043:
> remote Akka client disassociated
> 14/05/16 02:44:12 INFO TaskSetManager: Re-queueing tasks for 23 from
> TaskSet 12.0
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5628 (task 12.0:26)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5541 (task 12.0:23)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5744 (task 12.0:30)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5657 (task 12.0:27)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5570 (task 12.0:24)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5773 (task 12.0:31)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5686 (task 12.0:28)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5599 (task 12.0:25)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5802 (task 12.0:32)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5512 (task 12.0:22)
> 14/05/16 02:44:12 WARN TaskSetManager: Lost TID 5715 (task 12.0:29)
> 1*4/05/16 02:44:12 INFO DAGScheduler: Executor lost: 23 (epoch 3)*
> *14/05/16 02:44:12 INFO BlockManagerMasterActor: Trying to remove executor
> 23 from BlockManagerMaster.*
> *14/05/16 02:44:12 INFO BlockManagerMaster: Removed 23 successfully in
> removeExecutor*
> *14/05/16 02:44:13 INFO AppClient$ClientActor: Executor updated:
> app-20140516023701-0002/23 is now FAILED (Command exited with code 1)*
> *14/05/16 02:44:13 INFO SparkDeploySchedulerBackend: Executor
> app-20140516023701-0002/23 removed: Command exited with code 1*
> *14/05/16 02:44:13 INFO AppClient$ClientActor: Executor added:
> app-20140516023701-0002/29 on worker-20140516021952-Husky043-52882
> (Husky043:52882) with 11 cores*
>
>  and the web ui looks like this:
>
> ​
>
>    the code is very simple:
>
>    val data = MLUtils.loadLabeledData(sc, input_path)
>    val parsedData = data.map(_.features).cache()
>    val numIterations = 10
>    val numClusters = 30
>    val clusters = KMeans.train(parsedData, numClusters, numIterations)
>
>
>    my spark version is 0.9.
>
> =====================================================================
> Here is my analysis:
>
> ①I found the following log from one of my worker node, compare the
> timestamp between master and worker, I found the error occur after the
> executor send the task of stage 12 to master.
>
> 2014-05-16 02:43:59,124 [Executor task launch worker-0] DEBUG
> org.apache.spark.storage.BlockManager - Getting block rdd_3_386 from memory
> 2014-05-16 02:43:59,124 [Executor task launch worker-0] INFO
>  org.apache.spark.storage.BlockManager - Found block rdd_3_386 locally
> 2014-05-16 02:43:59,168 [Executor task launch worker-0] INFO
>  org.apache.spark.executor.Executor - Serialized size of result for 5490 is
> 520
> 2014-05-16 02:43:59,168 [Executor task launch worker-0] INFO
>  org.apache.spark.executor.Executor - Sending result for 5490 directly to
> driver
> 2014-05-16 02:43:59,168 [Executor task launch worker-0] INFO
>  org.apache.spark.executor.Executor - Finished task ID 5490
> 2014-05-16 02:44:14,044 [delete Spark local dirs] DEBUG
> org.apache.spark.storage.DiskBlockManager - Shutdown hook called
> 2014-05-16 02:44:14,044 [Thread-17] DEBUG org.apache.hadoop.ipc.Client -
> Stopping client
>
> ②Then I wonder maybe there is some low performance in the broadcast , so I
> changed the default httpBroadcast to TorrentBroadcast, but it also failed.
>
>
> I want to log the akka event , but I do not familiar with akka. So any
> advice or how to log akka is thankful !
>
>
>
>
>
>

Mime
View raw message