kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: leaders being kicked out of ISR
Date Thu, 17 Apr 2014 04:00:16 GMT
Any error in the controller/state-change logs? Do you have GC induced ZK
session expiration in the brokers?

Thanks,

Jun


On Wed, Apr 16, 2014 at 9:50 AM, Xinyao Hu <xinyaohu@dropbox.com> wrote:

> Hey all,
>
> This is xinyao from dropbox and I am trying to push kafka to widely used
> inside dropbox.
>
> I setup up five nodes with 2x replication. The system is running well to
> handle all the traffic. The problem I am seeing now is that one
> host(probably one leader) being kicked out of ISR. This is some related
> information.
>
> [2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153.
> (kafka.network.Processor)
> [2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:31:27,157] INFO Rolled new log segment for
> 'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:30,709] INFO Rolled new log segment for
> 'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms.
> (kafka.log.Log)
> [2014-04-16 15:31:32,310] INFO Rolled new log segment for
> 'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38.
> (kafka.network.Processor)
> [2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22.
> (kafka.network.Processor)
> [2014-04-16 15:31:54,954] INFO Rolled new log segment for
> 'api-not_your_cursor-2' in 1 ms. (kafka.log.Log)
> [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
> (kafka.network.Processor)
> [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
> (kafka.network.Processor)
> [2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181.
> (kafka.network.Processor)
> [2014-04-16 15:32:05,540] INFO Partition
> [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for
> partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,579] ERROR Conditional update of path
> /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
> with data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected
> version 0 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for
> /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,581] INFO Partition
> [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion
> [0] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker
> 1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,596] ERROR Conditional update of path
> /brokers/topics/api-shmodel-non-ssl/partitions/4/state with data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 0 failed due
> to org.apache.zookeeper.KeeperException$BadVersionException:
> KeeperErrorCode = BadVersion for
> /brokers/topics/api-shmodel-non-ssl/partitions/4/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker
> 1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,596] INFO Partition
> [server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for
> partition [server-network_logs_errors-0012,0] from 1,2 to 1
> (kafka.cluster.Partition)
> [2014-04-16 15:32:05,605] ERROR Conditional update of path
> /brokers/topics/server-network_logs_errors-0012/partitions/0/state with
> data
> {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
> and expected version 0
>  failed due to org.apache.zookeeper.KeeperException$BadVersionException:
> KeeperErrorCode = BadVersion for
> /brokers/topics/server-network_logs_errors-0012/partitions/0/state
> (kafka.utils.ZkUtils$)
> [2014-04-16 15:32:05,605] INFO Partition
> [server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not
> equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>
> You can see that something went wrong around 15:31. This is the current
> topic replicas information:
>
> Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr:
> 3,5
> Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr:
> 1,4
> Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr:
> 5,2
> Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
> Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr:
> 3,5
> Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr:
> 1,4
> Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr:
> 5,2
> Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
> Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr: 3
> Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr:
> 2,4
> Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr:
> 3,5
> Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr:
> 1,4
> Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr:
> 5,2
> Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr:
> 5,1
> Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr: 2
> Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr:
> 2,3
> Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr:
> 3,4
> Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr:
> 5,4
> Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2 Configs:
> Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr:
> 5,4
> Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr: 5
> Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr:
> 2,1
> Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr:
> 3,2
> Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr:
> 3,4
>
> ....
>
> I suspect that broker 1 lost leadership for all the topics around that
> timestamp but it is in sync as a follower.
>
> I followed "How to reduce churn" to increase replica.lag.max.messages.
> However I don't think it is useful since I suspect it is leader kicked out
> of ISR.
>
> I cannot consistently reproduce this since it happened twice, once a day in
> the past two days. These are all the configs I changed in
> server.properties.
>
> default.replication.factor=2
>
> auto.leader.rebalance.enable=true
>
> log.roll.hours=1
>
> replica.lag.max.messages=50000
>
> Let me know if you need more logs to debug.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message