spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saisai Shao <sai.sai.s...@gmail.com>
Subject Re: How should I do to solve this problem that the executors of myspark application always is blocked after an executor is lost?
Date Mon, 17 Aug 2015 05:47:07 GMT
Here is a link in executor page (stderr), you could click and see the logs:


Also for the failed executors, as far as I know you could log in to the
node where the executor runs and go to the log dir of Yarn, there's a
"userlogs" folder, you could find the log of executors you want.

For the Yarn node manager log, you could also find out in Yarn log dir. For
HDP it would be /var/log/hadoop-yarn/yarn by default.

Thanks
Saisai
​

On Mon, Aug 17, 2015 at 1:27 PM, 刚 <949294472@qq.com> wrote:

>
>  Hi Saisai:
>     Thank you for your reply. How can I get the log of executors and node
> manager except the yarn log that I have got by using the command "yarn logs
> --applicationId"?
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Saisai Shao";<sai.sai.shao@gmail.com>;
> *发送时间:* 2015年8月17日(星期一) 中午12:47
> *收件人:* "刚"<949294472@qq.com>;
> *抄送:* "user"<user@spark.apache.org>;
> *主题:* Re: How should I do to solve this problem that the executors of
> myspark application always is blocked after an executor is lost?
>
> Hi,
>
> I think you need to find some clues in the log of executors, as well as
> node manager to dig out more details and exception stacks, so that we could
> get a clear picture of what caused such problems.
>
> Thanks
> Saisai
>
> On Mon, Aug 17, 2015 at 11:35 AM, 刚 <949294472@qq.com> wrote:
>
>> Hi guys:
>>     I run 9 applications in my spark-cluster at the same time. They all
>> run well in the beginning. But after several hours, some applications lost
>> one executor, and other executors are blocked. By the way, I am using
>> spark-streaming to analysis real-time messages. The screenshots are as
>> follows.
>>                                                          Figure1: The
>> stage has lasted for long time after one executor is lost
>>
>>
>>                                                   Figure2:The task info
>> of the stage that has last for long time after one executor is lost
>>
>> The command that I submit an application is as follows:
>> spark-submit --class spark_security.login_users.Sockpuppet
>>  --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores
>> 4  --name pcLoginSparkDealerUser --master yarn  --deploy-mode cluster
>>  spark_Security-1.0-SNAPSHOT.jar
>> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/logindelaer.properties
>>
>> Another 8 applications are submitted to use the same driver-memory,
>> executor-memory, num-executors, executor-cores. And they are all run in
>> cluster mode.
>>
>> When the problem happens, I got the yarn logs use the  command as follows:
>>
>> yarn logs -application application_1439457182724_0026
>>
>> I can not find any stack of exception. But I find the information as
>> follows:
>> 15/08/17 00:32:53 INFO streaming.CheckpointWriter: Saving checkpoint for
>> time 1439472653000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000
>> /regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000'
>> 15/08/17  00:32:53 INFO streaming.CheckpointWriter: Deleting
>> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/safemodpasswd/checkpoint/checkpoint-1439472643000
>> 15/08/17  00:32:53 INFO streaming.CheckpointWriter: Checkpoint for time
>> 1439472653000 ms saved to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000
>> /regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000', took
>> 473939 bytes and 65 ms
>> 15/08/17  00:32:53 INFO transport.ProtocolStateActor: No response from
>> remote. Handshake timed out or transport failure detector triggered.
>> 15/08/17  00:32:53 ERROR cluster.YarnClusterScheduler: Lost executor 5
>> on A01-R08-2-I160-115.JD.LOCAL: remote Akka client disassociated
>> 15/08/17  00:32:53 WARN remote.ReliableDeliverySupervisor: Association
>> with remote system [akka.tcp://sparkExecutor@A01-R08-2-I160-115.JD.LOCAL:48922]
>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>> 15/08/17 00:32:54 INFO scheduler.TaskSetManager: Re-queueing tasks for 3
>> from TaskSet 3719.0
>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
>> invalid as zeroTime is 1439457657000 ms and slideDuration is 15000 ms and
>> difference is 14997000 ms
>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
>> invalid as zeroTime is 1439457657000 ms and slideDuration is 45000 ms and
>> difference is 14997000 ms
>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
>> invalid as zeroTime is 1439457657000 ms and slideDuration is 60000 ms and
>> difference is 14997000 ms
>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
>> invalid as zeroTime is 1439457657000 ms and slideDuration is 120000 ms and
>> difference is 14997000 ms
>> 15/08/17 00:32:54 INFO scheduler.JobScheduler: Added jobs for time
>> 1439472654000 ms
>> 15/08/17 00:32:54 INFO scheduler.JobGenerator: Checkpointing graph for
>> time 1439472654000 ms
>> 15/08/17 00:32:54 INFO streaming.DStreamGraph: Updating checkpoint data
>> for time 1439472654000 ms
>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 11.0 in stage
>> 3719.0 (TID 707634, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>> (executor 5 lost)
>> 15/08/17 00:32:54 INFO streaming.DStreamGraph: Updated checkpoint data
>> for time 1439472654000 ms
>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
>> 3719.0 (TID 707625, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>> (executor 5 lost)
>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 5.0 in stage
>> 3719.0 (TID 707628, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>> (executor 5 lost)
>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 8.0 in stage
>> 3719.0 (TID 707631, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>> (executor 5 lost)
>> 15/08/17  00:32:54 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch
>> 930)
>> 15/08/17  00:32:54 INFO storage.BlockManagerMasterActor: Trying to remove
>> executor 3 from BlockManagerMaster.
>> 15/08/17  00:32:54 INFO storage.BlockManagerMaster: Removed 3
>> successfully in removeExecutor
>> 15/08/17  00:32:54 INFO scheduler.Stage: Stage 3718 is now unavailable on
>> executor 3 (111/180, false)
>> 15/08/17  00:32:54 INFO streaming.CheckpointWriter: Saving checkpoint for
>> time 1439472654000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:90
>> 00/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472654000'
>>
>> Some one says that it is caused by OOM, but I can not find any stack of
>> OOM.
>>
>> I set the spark-defaults.con as follows:
>> spark.core.connection.ack.wait.timeout  3600
>> spark.core.connection.auth.wait.timeout 3600
>> spark.akka.frameSize                    1024
>> spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
>> spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
>> spark.akka.timeout                      900
>> spark.storage.memoryFraction            0.4
>> spark.rdd.compress
>>
>> It is very appreciated that anyone can tell me how to solve this problem.
>> It has botherd me for a long time.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message