Hi Saisai:
    Thank you for your reply. How can I get the log of executors and node manager except the yarn log that I have got by using the command "yarn logs --applicationId"? 

------------------ 原始邮件 ------------------
发件人: "Saisai Shao";<sai.sai.shao@gmail.com>;
发送时间: 2015年8月17日(星期一) 中午12:47
收件人: "刚"<949294472@qq.com>;
抄送: "user"<user@spark.apache.org>;
主题: Re: How should I do to solve this problem that the executors of myspark application always is blocked after an executor is lost?

Hi,

I think you need to find some clues in the log of executors, as well as node manager to dig out more details and exception stacks, so that we could get a clear picture of what caused such problems.

Thanks
Saisai

On Mon, Aug 17, 2015 at 11:35 AM, 刚 <949294472@qq.com> wrote:
Hi guys:
    I run 9 applications in my spark-cluster at the same time. They all run well in the beginning. But after several hours, some applications lost one executor, and other executors are blocked. By the way, I am using spark-streaming to analysis real-time messages. The screenshots are as follows.
                                                         Figure1: The stage has lasted for long time after one executor is lost


                                                  Figure2:The task info of the stage that has last for long time after one executor is lost

The command that I submit an application is as follows:
spark-submit --class spark_security.login_users.Sockpuppet  --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores 4  --name pcLoginSparkDealerUser --master yarn  --deploy-mode cluster  spark_Security-1.0-SNAPSHOT.jar hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/logindelaer.properties 
     
Another 8 applications are submitted to use the same driver-memory, executor-memory, num-executors, executor-cores. And they are all run in cluster mode.

When the problem happens, I got the yarn logs use the  command as follows:

yarn logs -application application_1439457182724_0026

I can not find any stack of exception. But I find the information as follows:
15/08/17 00:32:53 INFO streaming.CheckpointWriter: Saving checkpoint for time 1439472653000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000'
15/08/17  00:32:53 INFO streaming.CheckpointWriter: Deleting hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/safemodpasswd/checkpoint/checkpoint-1439472643000
15/08/17  00:32:53 INFO streaming.CheckpointWriter: Checkpoint for time 1439472653000 ms saved to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000', took 473939 bytes and 65 ms
15/08/17  00:32:53 INFO transport.ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered.
15/08/17  00:32:53 ERROR cluster.YarnClusterScheduler: Lost executor 5 on A01-R08-2-I160-115.JD.LOCAL: remote Akka client disassociated
15/08/17  00:32:53 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@A01-R08-2-I160-115.JD.LOCAL:48922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/17 00:32:54 INFO scheduler.TaskSetManager: Re-queueing tasks for 3 from TaskSet 3719.0
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime is 1439457657000 ms and slideDuration is 15000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime is 1439457657000 ms and slideDuration is 45000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime is 1439457657000 ms and slideDuration is 60000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime is 1439457657000 ms and slideDuration is 120000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO scheduler.JobScheduler: Added jobs for time 1439472654000 ms
15/08/17 00:32:54 INFO scheduler.JobGenerator: Checkpointing graph for time 1439472654000 ms
15/08/17 00:32:54 INFO streaming.DStreamGraph: Updating checkpoint data for time 1439472654000 ms
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 3719.0 (TID 707634, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 INFO streaming.DStreamGraph: Updated checkpoint data for time 1439472654000 ms
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 3719.0 (TID 707625, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 3719.0 (TID 707628, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 8.0 in stage 3719.0 (TID 707631, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17  00:32:54 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 930)
15/08/17  00:32:54 INFO storage.BlockManagerMasterActor: Trying to remove executor 3 from BlockManagerMaster.
15/08/17  00:32:54 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
15/08/17  00:32:54 INFO scheduler.Stage: Stage 3718 is now unavailable on executor 3 (111/180, false)
15/08/17  00:32:54 INFO streaming.CheckpointWriter: Saving checkpoint for time 1439472654000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472654000'

Some one says that it is caused by OOM, but I can not find any stack of OOM.

I set the spark-defaults.con as follows:
spark.core.connection.ack.wait.timeout  3600
spark.core.connection.auth.wait.timeout 3600
spark.akka.frameSize                    1024
spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
spark.akka.timeout                      900
spark.storage.memoryFraction            0.4
spark.rdd.compress      

It is very appreciated that anyone can tell me how to solve this problem. It has botherd me for a long time.