spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: One of the executor not getting StopExecutor message
Date Wed, 04 Mar 2015 06:38:54 GMT
Not quite sure, but you can try increasing the spark.akka.threads, most
likely it can be a yarn related issue.

Thanks
Best Regards

On Tue, Mar 3, 2015 at 3:38 PM, twinkle sachdeva <twinkle.sachdeva@gmail.com
> wrote:

> Hi,
>
> Operations are not very extensive, as this scenario is not always
> reproducible.
> One of the executor start behaving in this manner. For this particular
> application, we are using 8 cores in one executors, and practically, 4
> executors are launched on one machine.
>
> This machine has good config with respect to number of cores.
>
> Somehow, to me it seems to be some akka communication issue. If i try to
> take thread dump of the executor, once it appears to be in trouble, then
> time out happens.
>
> Can it be something related to* spark.akka.threads?*
>
>
>
> On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das <akhil@sigmoidanalytics.com>
> wrote:
>
>> Mostly, that particular executor is stuck on GC Pause, what operation are
>> you performing? You can try increasing the parallelism if you see only 1
>> executor is doing the task.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva <
>> twinkle.sachdeva@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running a spark application on Yarn in cluster mode.
>>> One of my executor appears to be in hang state, for  a long time, and
>>> gets finally killed by the driver.
>>>
>>> As compared to other executors, It have not received StopExecutor
>>> message from the driver.
>>>
>>> Here are the logs at the end of this container (C_1):
>>>
>>> --------------------------------------------------------------------------------
>>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
>>> broadcast 36, response is 2
>>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
>>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection
>>> to TMO-GCR70/192.168.162.70:9000 from admin: closed
>>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection
>>> to TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining
>>> connections 0
>>> 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>>> with renew id 1 executed
>>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>>> with renew id 1 expired
>>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>>> with renew id 1 exited
>>> 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
>>> SIGNAL 15: SIGTERM
>>>
>>> NOTE that it has no logs for more than 2hrs.
>>>
>>> Here are the logs at the end of normal container ( C_2):
>>>
>>> ------------------------------------------------------------------------------------
>>> 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
>>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
>>> received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
>>> :37906/user/CoarseGrainedScheduler#160899257]
>>> 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
>>> commanded a shutdown
>>> 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
>>> 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
>>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] *handled
>>> message (181.499835 ms) StopExecutor* from
>>> Actor[akka.tcp://sparkDriver@TMO-DN73
>>> :37906/user/CoarseGrainedScheduler#160899257]
>>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>>> Shutting down remote daemon.
>>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>>> Remote daemon shut down; proceeding with flushing remote transports.
>>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>>> Remoting shut down.
>>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>>> org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>>> org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
>>> org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no
>>> more references remain: org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
>>> 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
>>> 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called
>>>
>>> At the driver side, i can see the logs related to heartbeat messages
>>> from C_1 till 20:05:00
>>>
>>> ------------------------------------------------------------------------------------------
>>> 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received
>>> message Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7,
>>> TMO-DN73, 34106)) from Actor[akka.tcp://sparkExecutor@TMO-DN73
>>> :43671/temp/$fn]
>>>
>>> After this, it continues to receive the heartbeat from other executors
>>> except this one, and here follows the message responsible for it's SIGTERM:
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------
>>>
>>> 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
>>> BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
>>> 80515ms exceeds 45000ms
>>>
>>>
>>> I am using spark 1.2.1.
>>>
>>> Any pointer(s) ?
>>>
>>>
>>> Thanks,
>>>
>>> Twinkle
>>>
>>
>>
>

Mime
View raw message