spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benyi Wang <bewang.t...@gmail.com>
Subject Re: How should I do to solve this problem that the executors ofmyspark application always is blocked after an executor is lost?
Date Mon, 17 Aug 2015 17:54:10 GMT
You'd better check YARN node manager's logs. Log on the data node where the
executor is lost, find the log file. If you use cloudera manager, you can
find the log YARN->NodeManager->Instance, there is a link "Log File."

I usually found that the container were killed by YARN because the memory
exceeded the YARN container limitation. If I find an executor is lost, I
simply kill the yarn application, and use more memory or use more tasks for
shuffle operations. It seems to waste time if let Spark recover from this
issue. It highly likely for spark to fail again.

You used --executor-memory 3G and --executor-cores 4, each core have < 1GB,
which sounds too small. Depends on your application, you can increase
--executor-memory or decrease --executor-cores to make each core have more
memory. Or you need to increase parallelism so that the job is chopped into
a smaller size.

This is one of annoying issues of Spark and YARN. YARN kills a container,
but Spark doesn't know about it and tries again, and got killed again. YARN
should provide better information about such a killing.

On Mon, Aug 17, 2015 at 1:51 AM, Saisai Shao <sai.sai.shao@gmail.com> wrote:

> I think this stack is the result, not the reason, this exception is caused
> by executor lost when trying to connect to executor which has already
> failed. Would you please dig out to see if there's any other exceptions?
>
> Also I'm conservative to the solution of changing to nio, I don't think it
> is the Netty block transfer service which lead to such problem.
>
> Thanks
> Saisai
>
> On Mon, Aug 17, 2015 at 4:34 PM, 刚 <949294472@qq.com> wrote:
>
>> Hi Saisai:
>>     Thank you for your reply.
>>     I find the log of the lost executor, and find the stack as follows:
>>     15/08/16 13:47:19 ERROR shuffle.OneForOneBlockFetcher: Failed while
>> starting block fetches
>> java.io.IOException: Connection reset by peer
>>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>         at
>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>>         at
>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>>         at
>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
>>         at
>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>>         at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>         at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>         at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>         at
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>>         at java.lang.Thread.run(Thread.java:745)
>> 15/08/16 13:47:19 INFO shuffle.RetryingBlockFetche
>>
>>     And I find how to solve this problem in this link :
>> http://search-hadoop.com/m/q3RTtss52fWIdgL1
>>     I have lauched my application by setting spark.shuffle.blockTransferService=nio
>> to see whether I have solved it.
>>
>>
>>     I have a question that the stack was printed at 15/08/16 13:47:19,
>> but the exector is lost at 15/08/17 00:32:37. I am not certain whether
>> there is any relationship between the stack and the blocking of the
>> application.
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Saisai Shao";<sai.sai.shao@gmail.com>;
>> *发送时间:* 2015年8月17日(星期一) 中午1:47
>> *收件人:* "刚"<949294472@qq.com>;
>> *抄送:* "user"<user@spark.apache.org>;
>> *主题:* Re: How should I do to solve this problem that the executors
>> ofmyspark application always is blocked after an executor is lost?
>>
>> Here is a link in executor page (stderr), you could click and see the
>> logs:
>>
>>
>> Also for the failed executors, as far as I know you could log in to the
>> node where the executor runs and go to the log dir of Yarn, there's a
>> "userlogs" folder, you could find the log of executors you want.
>>
>> For the Yarn node manager log, you could also find out in Yarn log dir.
>> For HDP it would be /var/log/hadoop-yarn/yarn by default.
>>
>> Thanks
>> Saisai
>> ​
>>
>> On Mon, Aug 17, 2015 at 1:27 PM, 刚 <949294472@qq.com> wrote:
>>
>>>
>>>  Hi Saisai:
>>>     Thank you for your reply. How can I get the log of executors and
>>> node manager except the yarn log that I have got by using the command "yarn
>>> logs --applicationId"?
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Saisai Shao";<sai.sai.shao@gmail.com>;
>>> *发送时间:* 2015年8月17日(星期一) 中午12:47
>>> *收件人:* "刚"<949294472@qq.com>;
>>> *抄送:* "user"<user@spark.apache.org>;
>>> *主题:* Re: How should I do to solve this problem that the executors of
>>> myspark application always is blocked after an executor is lost?
>>>
>>> Hi,
>>>
>>> I think you need to find some clues in the log of executors, as well as
>>> node manager to dig out more details and exception stacks, so that we could
>>> get a clear picture of what caused such problems.
>>>
>>> Thanks
>>> Saisai
>>>
>>> On Mon, Aug 17, 2015 at 11:35 AM, 刚 <949294472@qq.com> wrote:
>>>
>>>> Hi guys:
>>>>     I run 9 applications in my spark-cluster at the same time. They all
>>>> run well in the beginning. But after several hours, some applications lost
>>>> one executor, and other executors are blocked. By the way, I am using
>>>> spark-streaming to analysis real-time messages. The screenshots are as
>>>> follows.
>>>>                                                          Figure1: The
>>>> stage has lasted for long time after one executor is lost
>>>>
>>>>
>>>>                                                   Figure2:The task info
>>>> of the stage that has last for long time after one executor is lost
>>>>
>>>> The command that I submit an application is as follows:
>>>> spark-submit --class spark_security.login_users.Sockpuppet
>>>>  --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores
>>>> 4  --name pcLoginSparkDealerUser --master yarn  --deploy-mode cluster
>>>>  spark_Security-1.0-SNAPSHOT.jar
>>>> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/logindelaer.properties
>>>>
>>>> Another 8 applications are submitted to use the same driver-memory,
>>>> executor-memory, num-executors, executor-cores. And they are all run in
>>>> cluster mode.
>>>>
>>>> When the problem happens, I got the yarn logs use the  command as
>>>> follows:
>>>>
>>>> yarn logs -application application_1439457182724_0026
>>>>
>>>> I can not find any stack of exception. But I find the information as
>>>> follows:
>>>> 15/08/17 00:32:53 INFO streaming.CheckpointWriter: Saving checkpoint
>>>> for time 1439472653000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000
>>>> /regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000'
>>>> 15/08/17  00:32:53 INFO streaming.CheckpointWriter: Deleting
>>>> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/safemodpasswd/checkpoint/checkpoint-1439472643000
>>>> 15/08/17  00:32:53 INFO streaming.CheckpointWriter: Checkpoint for time
>>>> 1439472653000 ms saved to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000
>>>> /regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000',
>>>> took 473939 bytes and 65 ms
>>>> 15/08/17  00:32:53 INFO transport.ProtocolStateActor: No response from
>>>> remote. Handshake timed out or transport failure detector triggered.
>>>> 15/08/17  00:32:53 ERROR cluster.YarnClusterScheduler: Lost executor 5
>>>> on A01-R08-2-I160-115.JD.LOCAL: remote Akka client disassociated
>>>> 15/08/17  00:32:53 WARN remote.ReliableDeliverySupervisor: Association
>>>> with remote system [akka.tcp://sparkExecutor@A01-R08-2-I160-115.JD.LOCAL:48922]
>>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>> 15/08/17 00:32:54 INFO scheduler.TaskSetManager: Re-queueing tasks for
>>>> 3 from TaskSet 3719.0
>>>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms
>>>> is invalid as zeroTime is 1439457657000 ms and slideDuration is 15000 ms
>>>> and difference is 14997000 ms
>>>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms
>>>> is invalid as zeroTime is 1439457657000 ms and slideDuration is 45000 ms
>>>> and difference is 14997000 ms
>>>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms
>>>> is invalid as zeroTime is 1439457657000 ms and slideDuration is 60000 ms
>>>> and difference is 14997000 ms
>>>> 15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms
>>>> is invalid as zeroTime is 1439457657000 ms and slideDuration is 120000 ms
>>>> and difference is 14997000 ms
>>>> 15/08/17 00:32:54 INFO scheduler.JobScheduler: Added jobs for time
>>>> 1439472654000 ms
>>>> 15/08/17 00:32:54 INFO scheduler.JobGenerator: Checkpointing graph for
>>>> time 1439472654000 ms
>>>> 15/08/17 00:32:54 INFO streaming.DStreamGraph: Updating checkpoint data
>>>> for time 1439472654000 ms
>>>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 11.0 in
>>>> stage 3719.0 (TID 707634, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>>>> (executor 5 lost)
>>>> 15/08/17 00:32:54 INFO streaming.DStreamGraph: Updated checkpoint data
>>>> for time 1439472654000 ms
>>>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 2.0 in
>>>> stage 3719.0 (TID 707625, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>>>> (executor 5 lost)
>>>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 5.0 in
>>>> stage 3719.0 (TID 707628, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>>>> (executor 5 lost)
>>>> 15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 8.0 in
>>>> stage 3719.0 (TID 707631, A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure
>>>> (executor 5 lost)
>>>> 15/08/17  00:32:54 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch
>>>> 930)
>>>> 15/08/17  00:32:54 INFO storage.BlockManagerMasterActor: Trying to
>>>> remove executor 3 from BlockManagerMaster.
>>>> 15/08/17  00:32:54 INFO storage.BlockManagerMaster: Removed 3
>>>> successfully in removeExecutor
>>>> 15/08/17  00:32:54 INFO scheduler.Stage: Stage 3718 is now unavailable
>>>> on executor 3 (111/180, false)
>>>> 15/08/17  00:32:54 INFO streaming.CheckpointWriter: Saving checkpoint
>>>> for time 1439472654000 ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:90
>>>> 00/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472654000'
>>>>
>>>> Some one says that it is caused by OOM, but I can not find any stack of
>>>> OOM.
>>>>
>>>> I set the spark-defaults.con as follows:
>>>> spark.core.connection.ack.wait.timeout  3600
>>>> spark.core.connection.auth.wait.timeout 3600
>>>> spark.akka.frameSize                    1024
>>>> spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
>>>> spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
>>>> spark.akka.timeout                      900
>>>> spark.storage.memoryFraction            0.4
>>>> spark.rdd.compress
>>>>
>>>> It is very appreciated that anyone can tell me how to solve this
>>>> problem. It has botherd me for a long time.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message