spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "刚" <949294...@qq.com>
Subject 回复: How should I do to solve this problem that the executorsofmyspark application always is blocked after an executor is lost?
Date Tue, 18 Aug 2015 02:54:00 GMT
Hi Saisai:
    My application has runned for 17 hours. It has not been blocked. But 2 executors have
lost. I can not find the stack that cause the executor to exit from the executor log. The
log configuration in yarn-site.xml is as follows.
 <property>
   <name>yarn.nodemanager.log.retain-seconds</name>
   <value>36000</value>
   <description>Default time (in seconds) to retain log files on the NodeManager Only
applicable if log-aggregation is disabled.</description>
  </property>


  <property>
   <name>yarn.log-aggregation.retain-seconds</name>
   <value>864000</value>
   <description>在HDFS上聚集日志保存的时间,10day=864000 seconds</description>
  </property>
  
  <property>
   <name>yarn.log-aggregation.retain-check-interval-seconds</name>
   <value>-1</value>
  </property>


  <property>
   <name>yarn.nodemanager.remote-app-log-dir</name>
   <value>/tmp/logs</value>
  </property>


  <property>
   <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
   <value>logs</value>
  </property>


  <property>
     <name>yarn.log-aggregation-enable</name>
     <value>true</value>
     <description>是否启用日志聚集功能,设置成true</description>
  </property>	
 
  <property>
     <name>yarn.nodemanager.log-dirs</name>
     <value>/export/servers/hadoop2.6.0/yarn_logs</value>
     <description>yarn node 日志存放地址</description>
  </property>



   My hadoop home directory is /export/servers/hadoop2.6.0
   The node that executor has lost is 172.19.160.116. So I got the logs from this directory:
/export/servers/hadoop2.6.0/yarn_logs/application_1439803298368_0003/container_1439803298368_0003_01_000002/stderr
There in no stack in this file.
From the yarn-site.xml configuration, I think this file should be  the yarn nodemanager log.


And I can not find "userlogs" directory in "/export/servers/hadoop2.6.0/logs" directory. 
How can I find the logs of executor that has lost? What is the difference between yarn node
manager logs and executor logs? Maybe, I think I could find the stack in executor logs. 
    
 


   
    




------------------ 原始邮件 ------------------
发件人: "Saisai Shao";<sai.sai.shao@gmail.com>;
发送时间: 2015年8月17日(星期一) 下午4:51
收件人: "刚"<949294472@qq.com>; 
抄送: "user"<user@spark.apache.org>; 
主题: Re: How should I do to solve this problem that the executorsofmyspark application
always is blocked after an executor is lost?



I think this stack is the result, not the reason, this exception is caused by executor lost
when trying to connect to executor which has already failed. Would you please dig out to see
if there's any other exceptions?

Also I'm conservative to the solution of changing to nio, I don't think it is the Netty block
transfer service which lead to such problem.


Thanks
Saisai 


On Mon, Aug 17, 2015 at 4:34 PM, 刚 <949294472@qq.com> wrote:
Hi Saisai:
    Thank you for your reply.
    I find the log of the lost executor, and find the stack as follows:
    15/08/16 13:47:19 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at java.lang.Thread.run(Thread.java:745)
15/08/16 13:47:19 INFO shuffle.RetryingBlockFetche


    And I find how to solve this problem in this link :
http://search-hadoop.com/m/q3RTtss52fWIdgL1
    I have lauched my application by setting spark.shuffle.blockTransferService=nio to see
whether I have solved it.


    
    I have a question that the stack was printed at 15/08/16 13:47:19, but the exector is
lost at 15/08/17 00:32:37. I am not certain whether there is any relationship between the
stack and the blocking of the application.
    
    


------------------ 原始邮件 ------------------
发件人: "Saisai Shao";<sai.sai.shao@gmail.com>;
发送时间: 2015年8月17日(星期一) 中午1:47
收件人: "刚"<949294472@qq.com>; 
抄送: "user"<user@spark.apache.org>; 
主题: Re: How should I do to solve this problem that the executors ofmyspark application
always is blocked after an executor is lost?



Here is a link in executor page (stderr), you could click and see the logs:





Also for the failed executors, as far as I know you could log in to the node where the executor
runs and go to the log dir of Yarn, there's a "userlogs" folder, you could find the log of
executors you want.


For the Yarn node manager log, you could also find out in Yarn log dir. For HDP it would be
/var/log/hadoop-yarn/yarn by default.


Thanks
Saisai
​




On Mon, Aug 17, 2015 at 1:27 PM, 刚 <949294472@qq.com> wrote:


 Hi Saisai:
    Thank you for your reply. How can I get the log of executors and node manager except the
yarn log that I have got by using the command "yarn logs --applicationId"? 


------------------ 原始邮件 ------------------
发件人: "Saisai Shao";<sai.sai.shao@gmail.com>;
发送时间: 2015年8月17日(星期一) 中午12:47
收件人: "刚"<949294472@qq.com>; 
抄送: "user"<user@spark.apache.org>; 
主题: Re: How should I do to solve this problem that the executors of myspark application
always is blocked after an executor is lost?



Hi,

I think you need to find some clues in the log of executors, as well as node manager to dig
out more details and exception stacks, so that we could get a clear picture of what caused
such problems.


Thanks
Saisai


On Mon, Aug 17, 2015 at 11:35 AM, 刚 <949294472@qq.com> wrote:
Hi guys:    I run 9 applications in my spark-cluster at the same time. They all run well in
the beginning. But after several hours, some applications lost one executor, and other executors
are blocked. By the way, I am using spark-streaming to analysis real-time messages. The screenshots
are as follows.

                                                         Figure1: The stage has lasted for
long time after one executor is lost





                                                  Figure2:The task info of the stage that
has last for long time after one executor is lost


The command that I submit an application is as follows:
spark-submit --class spark_security.login_users.Sockpuppet  --driver-memory 3g --executor-memory
3g --num-executors 3 --executor-cores 4  --name pcLoginSparkDealerUser --master yarn  --deploy-mode
cluster  spark_Security-1.0-SNAPSHOT.jar hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/logindelaer.properties

     
Another 8 applications are submitted to use the same driver-memory, executor-memory, num-executors,
executor-cores. And they are all run in cluster mode.


When the problem happens, I got the yarn logs use the  command as follows:


yarn logs -application application_1439457182724_0026


I can not find any stack of exception. But I find the information as follows:
15/08/17 00:32:53 INFO streaming.CheckpointWriter: Saving checkpoint for time 1439472653000
ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000'
15/08/17  00:32:53 INFO streaming.CheckpointWriter: Deleting hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/safemodpasswd/checkpoint/checkpoint-1439472643000
15/08/17  00:32:53 INFO streaming.CheckpointWriter: Checkpoint for time 1439472653000 ms saved
to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000',
took 473939 bytes and 65 ms
15/08/17  00:32:53 INFO transport.ProtocolStateActor: No response from remote. Handshake timed
out or transport failure detector triggered.
15/08/17  00:32:53 ERROR cluster.YarnClusterScheduler: Lost executor 5 on A01-R08-2-I160-115.JD.LOCAL:
remote Akka client disassociated
15/08/17  00:32:53 WARN remote.ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkExecutor@A01-R08-2-I160-115.JD.LOCAL:48922] has failed, address is now gated
for [5000] ms. Reason is: [Disassociated].
15/08/17 00:32:54 INFO scheduler.TaskSetManager: Re-queueing tasks for 3 from TaskSet 3719.0
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 15000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 45000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 60000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 120000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO scheduler.JobScheduler: Added jobs for time 1439472654000 ms
15/08/17 00:32:54 INFO scheduler.JobGenerator: Checkpointing graph for time 1439472654000
ms
15/08/17 00:32:54 INFO streaming.DStreamGraph: Updating checkpoint data for time 1439472654000
ms
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 3719.0 (TID 707634,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 INFO streaming.DStreamGraph: Updated checkpoint data for time 1439472654000
ms
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 3719.0 (TID 707625,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 3719.0 (TID 707628,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 8.0 in stage 3719.0 (TID 707631,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17  00:32:54 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 930)
15/08/17  00:32:54 INFO storage.BlockManagerMasterActor: Trying to remove executor 3 from
BlockManagerMaster.
15/08/17  00:32:54 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
15/08/17  00:32:54 INFO scheduler.Stage: Stage 3718 is now unavailable on executor 3 (111/180,
false)
15/08/17  00:32:54 INFO streaming.CheckpointWriter: Saving checkpoint for time 1439472654000
ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472654000'



Some one says that it is caused by OOM, but I can not find any stack of OOM.


I set the spark-defaults.con as follows:
spark.core.connection.ack.wait.timeout  3600
spark.core.connection.auth.wait.timeout 3600
spark.akka.frameSize                    1024
spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
spark.akka.timeout                      900
spark.storage.memoryFraction            0.4
spark.rdd.compress      



It is very appreciated that anyone can tell me how to solve this problem. It has botherd me
for a long time.
Mime
View raw message