kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: question about isr
Date Mon, 12 May 2014 14:49:35 GMT
Could you try 0.8.1.1, which fixed some bugs related to controller hanging
during shutdown.

For ZK session expiration, the #1 cause is GC. We have also seen that
transient issues with the underlying storage (e.g. raid controller reset)
can also cause ZK session expiration.

As for the data loss in the producer, you are probably using ack=1, which
could lead to data loss during leader failover. It just that in this case,
the failover window is forever due to some bugs.

Thanks,

Jun


On Sun, May 11, 2014 at 10:14 PM, Ian Friedman <ian@flurry.com> wrote:

> Jun - We're using 0.8.1
> The timestamps in the last few entries in controller.log seem to
> correspond to when the trouble started in server.log
>
> controller.log:
>
> [2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK
> expired; shut down all controller components and try to re-elect
> (kafka.controller.KafkaController$SessionExpirationListener)
>
> [2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down
> (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
>
> server.log:
>
> [2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3.
> (kafka.network.Processor)
>
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
> (kafka.network.Processor)
>
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
> (kafka.network.Processor)
>
> [2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27]
> on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27]
> from 1,2,3 to 1 (kafka.cluster.Partition)
>
> [2014-05-08 19:01:30,109] ERROR Conditional update of path
> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data
> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
> and expected version 9 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for
> /brokers/topics/callbackServiceTopic-Medium/partitions/27/state
> (kafka.utils.ZkUtils$)
>
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27]
> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
> updating ISR (kafka.cluster.Partition)
>
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on
> broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from
> 1,2,3 to 1 (kafka.cluster.Partition)
>
> [2014-05-08 19:01:30,111] ERROR Conditional update of path
> /brokers/topics/callbackServiceTopic-High/partitions/3/state with data
> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
> and expected version 9 failed due to
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state
> (kafka.utils.ZkUtils$)
>
> I'm also concerned why this happened at all... I do see some slightly high
> CMS activity in the GC log but I don't see any Full GCs and the whole thing
> was over within a minute... I'm a bit concerned how something like this
> could make a broker fail this way!
>
> One of the after effects we found from leaving the broker running in this
> state is that it was still allowing producers with cached metadata to
> produce to it, so we lost a whole bunch of messages before I shut him down.
> What's up with that??
>
>
> --Ian
>
>
> On Sun, May 11, 2014 at 11:39 PM, Jun Rao <junrao@gmail.com> wrote:
>
>> It seems that broker 1 had a soft failure (potentially due to GC).
>> However, somehow it didn't receive the latest LeaderAndIsrRequest from the
>> controller. Which version of Kafka are you using? In the controller log, do
>> you see broker 1 being added back as a live broker?
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ian@flurry.com> wrote:
>>
>>> This seems similar to behavior we’re seeing. At some point one of our
>>> brokers (id 1) just gives up and starts throwing those errors and
>>> kafka-topics no longer lists it as a ISR. However the logs for that broker
>>> say something very odd:
>>>
>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8]
>>> on broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on
>>> broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3
>>> to 1 (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,251] ERROR Conditional update of path
>>> /brokers/topics/callbackServiceTopic/partitions/3/state with data
>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>> and expected version 9 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state
>>> (kafka.utils.ZkUtils$)
>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on
>>> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31]
>>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31]
>>> from 1,2,3 to 1 (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,255] ERROR Conditional update of path
>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state with data
>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>> and expected version 9 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for
>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>> (kafka.utils.ZkUtils$)
>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31]
>>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on
>>> broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from
>>> 1,2,3 to 1 (kafka.cluster.Partition)
>>> [2014-05-09 10:16:00,258] ERROR Conditional update of path
>>> /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data
>>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>>> and expected version 9 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state
>>> (kafka.utils.ZkUtils$)
>>>
>>> etc. And these errors continue every few seconds.
>>>
>>> kafka-topics.sh —describe output:
>>> Topic:callbackServiceTopic-High PartitionCount:50
>>> ReplicationFactor:3     Configs:
>>>         Topic: callbackServiceTopic-High        Partition: 0    Leader:
>>> 2       Replicas: 3,1,2 Isr: 2,3
>>>         Topic: callbackServiceTopic-High        Partition: 1    Leader:
>>> 2       Replicas: 1,2,3 Isr: 2,3
>>>         Topic: callbackServiceTopic-High        Partition: 2    Leader:
>>> 2       Replicas: 2,3,1 Isr: 2,3
>>>
>>> I went and looked at one of those ZNodes in the zkCLI and found this:
>>>
>>> [zk: localhost:2181(CONNECTED) 2] get
>>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>>>
>>> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
>>>
>>> What does the version number there represent and how does it get out of
>>> sync? Should I restart broker 1? Is the fact that broker 1 is behind in
>>> leader_epoch significant?
>>>
>>> Still trying to figure out Kafka operations :(
>>> —Ian
>>>
>>>
>>> On Apr 24, 2014, at 9:26 PM, 陈小军 <chenxiaojun@nhn.com> wrote:
>>>
>>> > I don't do any partition reassignment.
>>> >
>>> > When broker occure following error, this phenomenon  will happen.
>>> >
>>> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition
>>> [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition
>>> [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,313] ERROR Conditional update of path
>>> /brokers/topics/nelo2-normal-logs/partitions/0/state with data
>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>> and expected version 7 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state
>>> (kafka.utils.ZkUtils$)
>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on
>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1]
>>> on broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from
>>> 0,2 to 0 (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,315] ERROR Conditional update of path
>>> /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data
>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]}
>>> and expected version 8 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
>>> (kafka.utils.ZkUtils$)
>>> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1]
>>> on broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on
>>> broker 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0
>>> (kafka.cluster.Partition)
>>> > [2014-03-14 12:11:44,318] ERROR Conditional update of path
>>> /brokers/topics/nelo2-crash-logs/partitions/1/state with data
>>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>>> and expected version 7 failed due to
>>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>>> BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state
>>> (kafka.utils.ZkUtils$)
>>> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on
>>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>>> updating ISR (kafka.cluster.Partit
>>> > Best Regards
>>> > Jerry
>>> > -----Original Message-----
>>> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
>>> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&
>>> lt;chenxiaojun@nhn.com&gt;;
>>> > Cc:
>>> > Sent: 2014-04-25 (星期五) 02:12:02
>>> > Subject: Re: question about isr
>>> >
>>> > Interesting. Which version of Kafka are you using? Were you doing some
>>> partition reassignment? Thanks, Jun
>>> >
>>> >
>>> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt;
>>> wrote:
>>> >
>>> > Hi Team,
>>> >
>>> >   I found a strange phenomenon of isr list in my kafka cluster
>>> >
>>> >
>>> >
>>> >   When I use the tool that kafka provide to get the topic information,
>>> and it show isr list as following, seem it is ok
>>> >
>>> >
>>> >
>>> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe
>>> --zookeeper 10.96.250.215:10013,10.96.250.216:10013,
>>> 10.96.250.217:10013/nelo2-kafka
>>> >
>>> >
>>> >
>>> >
>>> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
>>>   Configs:
>>> >
>>> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3
>>> Replicas: 3,0   Isr: 0,3
>>> >
>>> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0
>>> Replicas: 0,1   Isr: 0,1
>>> >
>>> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1
>>> Replicas: 1,3   Isr: 1,3
>>> >
>>> >
>>> >
>>> >  but when I use some sdk to get the meta info from broker, the isr is
>>> different.
>>> >
>>> > metadata:  { size: 246,
>>> >
>>> >  correlationId: 0,
>>> >
>>> >  brokerNum: -1,
>>> >
>>> >  nodeId: 1,
>>> >
>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>> >
>>> >  port: 9093,
>>> >
>>> >  topicNum: 0,
>>> >
>>> >  topicError: 0,
>>> >
>>> >  topic: 'nelo2-normal-logs',
>>> >
>>> >  partitionNum: 2,
>>> >
>>> >  errorCode: 0,
>>> >
>>> >  partition: 0,
>>> >
>>> >  leader: 3,
>>> >
>>> >  replicasNum: 2,
>>> >
>>> >  replicas: [ 3, 0 ],
>>> >
>>> >  isrNum: 2,
>>> >
>>> >  isr: [ 0, 3 ] }
>>> >
>>> > metadata:  { size: 246,
>>> >
>>> >  correlationId: 0,
>>> >
>>> >  brokerNum: -1,
>>> >
>>> >  nodeId: 1,
>>> >
>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>> >
>>> >  port: 9093,
>>> >
>>> >  topicNum: 0,
>>> >
>>> >  topicError: 0,
>>> >
>>> >  topic: 'nelo2-normal-logs',
>>> >
>>> >  partitionNum: 1,
>>> >
>>> >  errorCode: 0,
>>> >
>>> >  partition: 1,
>>> >
>>> >  leader: 0,
>>> >
>>> >  replicasNum: 2,
>>> >
>>> >  replicas: [ 0, 1 ],
>>> >
>>> >  isrNum: 2,
>>> >
>>> >  isr: [ 0, 1 ] }
>>> >
>>> > metadata:  { size: 246,
>>> >
>>> >  correlationId: 0,
>>> >
>>> >  brokerNum: -1,
>>> >
>>> >  nodeId: 1,
>>> >
>>> >  host: 'xseed171.kdev.nhnsystem.com',
>>> >
>>> >  port: 9093,
>>> >
>>> >  topicNum: 0,
>>> >
>>> >  topicError: 0,
>>> >
>>> >  topic: 'nelo2-normal-logs',
>>> >
>>> >  partitionNum: 0,
>>> >
>>> >  errorCode: 0,
>>> >
>>> >  partition: 2,
>>> >
>>> >  leader: 1,
>>> >
>>> >  replicasNum: 2,
>>> >
>>> >  replicas: [ 1, 3 ],
>>> >
>>> >  isrNum: 1,
>>> >
>>> >  isr: [ 1 ] }
>>> >
>>> >
>>> >
>>> > I also use other sdk, get the same result. I check the logs from
>>> kafka, it seems the sdk result is right. the tool get the wrong result. why
>>> is it happend?
>>> >
>>> >
>>> >
>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
>>> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>> >
>>> >
>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
>>> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>> >
>>> >
>>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>>> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
>>> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
>>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>>> >
>>> >
>>> >
>>> >
>>> > Thanks~!
>>> >
>>> >
>>> >
>>> > Best Regards
>>> >
>>> > Jerry
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message