kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayuresh Gharat <gharatmayures...@gmail.com>
Subject Re: Controller sometimes lose ISR
Date Thu, 05 Nov 2015 23:31:32 GMT
I am not sure about this. It might be related to your GC settings. But I am
not sure why it only occurs on Friday night.

Thanks,

Mayuresh

On Tue, Nov 3, 2015 at 3:01 AM, Gleb Zhukov <gzhukov@iponweb.net> wrote:

> Hi, Mayuresh. No, this log before restart 61.
> But I found some interesting logs about ZK on problem broker:
>
> root@kafka3d:~# zgrep 'zookeeper state changed (Expired)'
> /var/log/kafka/*/*
> /var/log/kafka/2015-10-30/kafka-2015-10-30.log.gz:[2015-10-30
> 23:02:31,001] 284371992 [main-EventThread] INFO
>  org.I0Itec.zkclient.ZkClient  - zookeeper state changed (Expired)
>
> root@kafka3d:~# zgrep -i shut
> /var/log/kafka/2015-10-30/kafka-2015-10-30.log.gz
> [2015-10-30 23:02:31,004] 284371995 [main-EventThread] INFO
>  org.apache.zookeeper.ClientCnxn  - EventThread shut down
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> [2015-10-30 23:10:22,206] 284843197 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-77], Shutting
> down
> [2015-10-30 23:10:22,213] 284843204 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-77], Shutdown
> completed
> [2015-10-30 23:10:22,213] 284843204 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-1-77], Shutting
> down
> [2015-10-30 23:10:22,215] 284843206 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-1-77], Shutdown
> completed
> [2015-10-30 23:10:22,215] 284843206 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-2-77], Shutting
> down
> [2015-10-30 23:10:22,396] 284843387 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-2-77], Shutdown
> completed
> [2015-10-30 23:10:22,396] 284843387 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-3-77], Shutting
> down
> [2015-10-30 23:10:22,439] 284843430 [kafka-request-handler-2] INFO
>  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-3-77], Shutdown
> completed
>
> Is it related to my GC settings?
>
> KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:InitiatingHeapOccupancyPercent=35"
> KAFKA_HEAP_OPTS="-Xmx8G"
>
> Also I attached some GC graphs from JMX
>
>
>
> On Tue, Nov 3, 2015 at 1:21 AM, Mayuresh Gharat <
> gharatmayuresh15@gmail.com> wrote:
>
>> The broker 61 some how falls behind in fetching from the leader brokers
>> and
>> hence falls out of the ISR.
>>
>> [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state
>> change of replica 61 for partition [test-res-met.server_logs.conv,18] from
>> OnlineReplica to OfflineReplica...
>> means that the current controller underwent a failure and came back up,
>> but
>> some other controller was elected in meant time. The old controller will
>> eventually resign.
>> Is this log after you rebounce 61?
>>
>>
>> Thanks,
>>
>> Mayuresh
>>
>> On Sat, Oct 31, 2015 at 5:09 AM, Gleb Zhukov <gzhukov@iponweb.net> wrote:
>>
>> > Hi, Everybody!
>> >
>> > Every week on Friday's night I lose ISR for some partitions in my kafka
>> > cluster:
>> >
>> > Topic: test-res-met.server_logs.conv  Partition: 18    Leader: 45
>> > Replicas: 45,61    Isr: 45
>> > Current controller: 45
>> > Partitions with leader #61 are available, I lose broker #61 only as ISR
>> for
>> > partitions with another leader.
>> >
>> > State logs on broker 61:
>> >
>> > [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state
>> > change of replica 61 for partition [test-res-met.server_logs.conv,18]
>> from
>> > OnlineReplica to OfflineReplic
>> > a failed (state.change.logger)
>> > kafka.common.StateChangeFailedException: Leader and isr path written by
>> > another controller. This probablymeans the current controller with epoch
>> > 2233 went through a soft failure
>> > and another controller was elected with epoch 2234. Aborting state
>> change
>> > by this controller
>> >         at
>> >
>> >
>> kafka.controller.KafkaController.removeReplicaFromIsr(KafkaController.scala:1002)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine.handleStateChange(ReplicaStateMachine.scala:250)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114)
>> >         at
>> > scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
>> >         at
>> >
>> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
>> >         at
>> >
>> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
>> >         at
>> >
>> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:114)
>> >         at
>> >
>> kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:451)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:373)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
>> >         at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
>> >         at kafka.utils.Utils$.inLock(Utils.scala:535)
>> >         at
>> >
>> >
>> kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356)
>> >         at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568)
>> >         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
>> >
>> > Restart of bad broker (#61) helps.
>> > We have 7day retention for our logs (log.retention.hours=168). Also I
>> > checked ZK and cron. Could anyone explain such issue? Kafka 0.8.2.1.
>> >
>>
>>
>>
>> --
>> -Regards,
>> Mayuresh R. Gharat
>> (862) 250-7125
>>
>
>
>
> --
> Best regards,
> Gleb Zhukov
> IPONWEB
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message