spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saisai Shao <sai.sai.s...@gmail.com>
Subject Re: How should I do to solve this problem that the executors of my spark application always is blocked after an executor is lost?
Date Mon, 17 Aug 2015 04:47:44 GMT
Hi,

I think you need to find some clues in the log of executors, as well as
node manager to dig out more details and exception stacks, so that we could
get a clear picture of what caused such problems.

Thanks
Saisai

On Mon, Aug 17, 2015 at 11:35 AM, 刚 <949294472@qq.com> wrote:

> Hi guys:
>     I run 9 applications in my spark-cluster at the same time. They all
> run well in the beginning. But after several hours, some applications lost
> one executor, and other executors are blocked. By the way, I am using
> spark-streaming to analysis real-time messages. The screenshots are as
> follows.
>                                                          Figure1: The
> stage has lasted for long time after one executor is lost
>
>
>                                                   Figure2:The task info of
> the stage that has last for long time after one executor is lost
>
> The command that I submit an application is as follows:
> spark-submit --class spark_security.login_users.Sockpuppet
>  --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores
> 4  --name pcLoginSparkDealerUser --master yarn  --deploy-mode cluster
>  spark_Security-1.0-SNAPSHOT.jar
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/logindelaer.properties
>
> Another 8 applications are submitted to use the same driver-memory,
> executor-memory, num-executors, executor-cores. And they are all run in
> cluster mode.
>
> When the problem happens, I got the yarn logs use the  command as follows:
>
> yarn logs -application application_1439457182724_0026
>
> I can not find any stack of exception. But I find the information as
> follows:
> 15/08/17 00:32:53 INFO streaming.CheckpointWriter: Saving checkpoint for
> time 1439472653000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000
> /regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000'
> 15/08/17  00:32:53 INFO streaming.CheckpointWriter: Deleting
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/safemodpasswd/checkpoint/checkpoint-1439472643000
> 15/08/17  00:32:53 INFO streaming.CheckpointWriter: Checkpoint for time
> 1439472653000 ms saved to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000
> /regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000', took
> 473939 bytes and 65 ms
> 15/08/17  00:32:53 INFO transport.ProtocolStateActor: No response from
> remote. Handshake timed out or transport failure detector triggered.
> 15/08/17  00:32:53 ERROR cluster.YarnClusterScheduler: Lost executor 5 on
> A01-R08-2-I160-115.JD.LOCAL: remote Akka client disassociated
> 15/08/17  00:32:53 WARN remote.ReliableDeliverySupervisor: Association
> with remote system [akka.tcp://sparkExecutor@A01-R08-2-I160-115.JD.LOCAL:48922]
> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 15/08/17 00:32:54 INFO scheduler.TaskSetManager: Re-queueing tasks for 3
> from TaskSet 3719.0
> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
> invalid as zeroTime is 1439457657000 ms and slideDuration is 15000 ms and
> difference is 14997000 ms
> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
> invalid as zeroTime is 1439457657000 ms and slideDuration is 45000 ms and
> difference is 14997000 ms
> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
> invalid as zeroTime is 1439457657000 ms and slideDuration is 60000 ms and
> difference is 14997000 ms
> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is
> invalid as zeroTime is 1439457657000 ms and slideDuration is 120000 ms and
> difference is 14997000 ms
> 15/08/17 00:32:54 INFO scheduler.JobScheduler: Added jobs for time
> 1439472654000 ms
> 15/08/17 00:32:54 INFO scheduler.JobGenerator: Checkpointing graph for
> time 1439472654000 ms
> 15/08/17 00:32:54 INFO streaming.DStreamGraph: Updating checkpoint data
> for time 1439472654000 ms
> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 11.0 in stage
> 3719.0 (TID 707634, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
> (executor 5 lost)
> 15/08/17 00:32:54 INFO streaming.DStreamGraph: Updated checkpoint data for
> time 1439472654000 ms
> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
> 3719.0 (TID 707625, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
> (executor 5 lost)
> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 5.0 in stage
> 3719.0 (TID 707628, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
> (executor 5 lost)
> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 8.0 in stage
> 3719.0 (TID 707631, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
> (executor 5 lost)
> 15/08/17  00:32:54 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch
> 930)
> 15/08/17  00:32:54 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 3 from BlockManagerMaster.
> 15/08/17  00:32:54 INFO storage.BlockManagerMaster: Removed 3 successfully
> in removeExecutor
> 15/08/17  00:32:54 INFO scheduler.Stage: Stage 3718 is now unavailable on
> executor 3 (111/180, false)
> 15/08/17  00:32:54 INFO streaming.CheckpointWriter: Saving checkpoint for
> time 1439472654000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:90
> 00/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472654000'
>
> Some one says that it is caused by OOM, but I can not find any stack of
> OOM.
>
> I set the spark-defaults.con as follows:
> spark.core.connection.ack.wait.timeout  3600
> spark.core.connection.auth.wait.timeout 3600
> spark.akka.frameSize                    1024
> spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
> spark.akka.timeout                      900
> spark.storage.memoryFraction            0.4
> spark.rdd.compress
>
> It is very appreciated that anyone can tell me how to solve this problem.
> It has botherd me for a long time.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message