kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Friedman <...@flurry.com>
Subject Re: question about isr
Date Mon, 12 May 2014 05:14:19 GMT
Jun - We're using 0.8.1
The timestamps in the last few entries in controller.log seem to correspond
to when the trouble started in server.log

controller.log:

[2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK
expired; shut down all controller components and try to re-elect
(kafka.controller.KafkaController$SessionExpirationListener)

[2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down
(kafka.controller.TopicDeletionManager$DeleteTopicsThread)

server.log:

[2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3.
(kafka.network.Processor)

[2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
(kafka.network.Processor)

[2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2.
(kafka.network.Processor)

[2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27]
on broker 1: Shrinking ISR for partition [callbackServiceTopic-Medium,27]
from 1,2,3 to 1 (kafka.cluster.Partition)

[2014-05-08 19:01:30,109] ERROR Conditional update of path
/brokers/topics/callbackServiceTopic-Medium/partitions/27/state with data
{"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
and expected version 9 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/brokers/topics/callbackServiceTopic-Medium/partitions/27/state
(kafka.utils.ZkUtils$)

[2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27]
on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
updating ISR (kafka.cluster.Partition)

[2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on
broker 1: Shrinking ISR for partition [callbackServiceTopic-High,3] from
1,2,3 to 1 (kafka.cluster.Partition)

[2014-05-08 19:01:30,111] ERROR Conditional update of path
/brokers/topics/callbackServiceTopic-High/partitions/3/state with data
{"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
and expected version 9 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state
(kafka.utils.ZkUtils$)

I'm also concerned why this happened at all... I do see some slightly high
CMS activity in the GC log but I don't see any Full GCs and the whole thing
was over within a minute... I'm a bit concerned how something like this
could make a broker fail this way!

One of the after effects we found from leaving the broker running in this
state is that it was still allowing producers with cached metadata to
produce to it, so we lost a whole bunch of messages before I shut him down.
What's up with that??


--Ian


On Sun, May 11, 2014 at 11:39 PM, Jun Rao <junrao@gmail.com> wrote:

> It seems that broker 1 had a soft failure (potentially due to GC).
> However, somehow it didn't receive the latest LeaderAndIsrRequest from the
> controller. Which version of Kafka are you using? In the controller log, do
> you see broker 1 being added back as a live broker?
>
> Thanks,
>
> Jun
>
>
> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ian@flurry.com> wrote:
>
>> This seems similar to behavior we’re seeing. At some point one of our
>> brokers (id 1) just gives up and starts throwing those errors and
>> kafka-topics no longer lists it as a ISR. However the logs for that broker
>> say something very odd:
>>
>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8] on
>> broker 1: Cached zkVersion [10] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on
>> broker 1: Shrinking ISR for partition [callbackServiceTopic,3] from 1,2,3
>> to 1 (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,251] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic/partitions/3/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state
>> (kafka.utils.ZkUtils$)
>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on
>> broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31]
>> on broker 1: Shrinking ISR for partition [callbackServiceTopic-High,31]
>> from 1,2,3 to 1 (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,255] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-High/partitions/31/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for
>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>> (kafka.utils.ZkUtils$)
>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31]
>> on broker 1: Cached zkVersion [9] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on
>> broker 1: Shrinking ISR for partition [callbackServiceTopic-Low,3] from
>> 1,2,3 to 1 (kafka.cluster.Partition)
>> [2014-05-09 10:16:00,258] ERROR Conditional update of path
>> /brokers/topics/callbackServiceTopic-Low/partitions/3/state with data
>> {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]}
>> and expected version 9 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state
>> (kafka.utils.ZkUtils$)
>>
>> etc. And these errors continue every few seconds.
>>
>> kafka-topics.sh —describe output:
>> Topic:callbackServiceTopic-High PartitionCount:50
>> ReplicationFactor:3     Configs:
>>         Topic: callbackServiceTopic-High        Partition: 0    Leader: 2
>>       Replicas: 3,1,2 Isr: 2,3
>>         Topic: callbackServiceTopic-High        Partition: 1    Leader: 2
>>       Replicas: 1,2,3 Isr: 2,3
>>         Topic: callbackServiceTopic-High        Partition: 2    Leader: 2
>>       Replicas: 2,3,1 Isr: 2,3
>>
>> I went and looked at one of those ZNodes in the zkCLI and found this:
>>
>> [zk: localhost:2181(CONNECTED) 2] get
>> /brokers/topics/callbackServiceTopic-High/partitions/31/state
>> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
>>
>> What does the version number there represent and how does it get out of
>> sync? Should I restart broker 1? Is the fact that broker 1 is behind in
>> leader_epoch significant?
>>
>> Still trying to figure out Kafka operations :(
>> —Ian
>>
>>
>> On Apr 24, 2014, at 9:26 PM, 陈小军 <chenxiaojun@nhn.com> wrote:
>>
>> > I don't do any partition reassignment.
>> >
>> > When broker occure following error, this phenomenon  will happen.
>> >
>> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition
>> [nelo2-normal-logs,0] on broker 0: Shrinking ISR for partition
>> [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,313] ERROR Conditional update of path
>> /brokers/topics/nelo2-normal-logs/partitions/0/state with data
>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>> and expected version 7 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state
>> (kafka.utils.ZkUtils$)
>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on
>> broker 0: Cached zkVersion [7] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on
>> broker 0: Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2
>> to 0 (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,315] ERROR Conditional update of path
>> /brokers/topics/nelo2-symbolicated-logs/partitions/1/state with data
>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]}
>> and expected version 8 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
>> (kafka.utils.ZkUtils$)
>> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on
>> broker 0: Cached zkVersion [8] not equal to that in zookeeper, skip
>> updating ISR (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker
>> 0: Shrinking ISR for partition [nelo2-crash-logs,1] from 0,1 to 0
>> (kafka.cluster.Partition)
>> > [2014-03-14 12:11:44,318] ERROR Conditional update of path
>> /brokers/topics/nelo2-crash-logs/partitions/1/state with data
>> {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]}
>> and expected version 7 failed due to
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
>> BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state
>> (kafka.utils.ZkUtils$)
>> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker
>> 0: Cached zkVersion [7] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partit
>> > Best Regards
>> > Jerry
>> > -----Original Message-----
>> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
>> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&
>> lt;chenxiaojun@nhn.com&gt;;
>> > Cc:
>> > Sent: 2014-04-25 (星期五) 02:12:02
>> > Subject: Re: question about isr
>> >
>> > Interesting. Which version of Kafka are you using? Were you doing some
>> partition reassignment? Thanks, Jun
>> >
>> >
>> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt;
>> wrote:
>> >
>> > Hi Team,
>> >
>> >   I found a strange phenomenon of isr list in my kafka cluster
>> >
>> >
>> >
>> >   When I use the tool that kafka provide to get the topic information,
>> and it show isr list as following, seem it is ok
>> >
>> >
>> >
>> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe
>> --zookeeper 10.96.250.215:10013,10.96.250.216:10013,
>> 10.96.250.217:10013/nelo2-kafka
>> >
>> >
>> >
>> >
>> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2
>> Configs:
>> >
>> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3
>> Replicas: 3,0   Isr: 0,3
>> >
>> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0
>> Replicas: 0,1   Isr: 0,1
>> >
>> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1
>> Replicas: 1,3   Isr: 1,3
>> >
>> >
>> >
>> >  but when I use some sdk to get the meta info from broker, the isr is
>> different.
>> >
>> > metadata:  { size: 246,
>> >
>> >  correlationId: 0,
>> >
>> >  brokerNum: -1,
>> >
>> >  nodeId: 1,
>> >
>> >  host: 'xseed171.kdev.nhnsystem.com',
>> >
>> >  port: 9093,
>> >
>> >  topicNum: 0,
>> >
>> >  topicError: 0,
>> >
>> >  topic: 'nelo2-normal-logs',
>> >
>> >  partitionNum: 2,
>> >
>> >  errorCode: 0,
>> >
>> >  partition: 0,
>> >
>> >  leader: 3,
>> >
>> >  replicasNum: 2,
>> >
>> >  replicas: [ 3, 0 ],
>> >
>> >  isrNum: 2,
>> >
>> >  isr: [ 0, 3 ] }
>> >
>> > metadata:  { size: 246,
>> >
>> >  correlationId: 0,
>> >
>> >  brokerNum: -1,
>> >
>> >  nodeId: 1,
>> >
>> >  host: 'xseed171.kdev.nhnsystem.com',
>> >
>> >  port: 9093,
>> >
>> >  topicNum: 0,
>> >
>> >  topicError: 0,
>> >
>> >  topic: 'nelo2-normal-logs',
>> >
>> >  partitionNum: 1,
>> >
>> >  errorCode: 0,
>> >
>> >  partition: 1,
>> >
>> >  leader: 0,
>> >
>> >  replicasNum: 2,
>> >
>> >  replicas: [ 0, 1 ],
>> >
>> >  isrNum: 2,
>> >
>> >  isr: [ 0, 1 ] }
>> >
>> > metadata:  { size: 246,
>> >
>> >  correlationId: 0,
>> >
>> >  brokerNum: -1,
>> >
>> >  nodeId: 1,
>> >
>> >  host: 'xseed171.kdev.nhnsystem.com',
>> >
>> >  port: 9093,
>> >
>> >  topicNum: 0,
>> >
>> >  topicError: 0,
>> >
>> >  topic: 'nelo2-normal-logs',
>> >
>> >  partitionNum: 0,
>> >
>> >  errorCode: 0,
>> >
>> >  partition: 2,
>> >
>> >  leader: 1,
>> >
>> >  replicasNum: 2,
>> >
>> >  replicas: [ 1, 3 ],
>> >
>> >  isrNum: 1,
>> >
>> >  isr: [ 1 ] }
>> >
>> >
>> >
>> > I also use other sdk, get the same result. I check the logs from kafka,
>> it seems the sdk result is right. the tool get the wrong result. why is it
>> happend?
>> >
>> >
>> >
>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>> (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
>> for partition [nelo2-normal-logs,1] in response to UpdateMetadata request
>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>> >
>> >
>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>> (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
>> for partition [nelo2-normal-logs,2] in response to UpdateMetadata request
>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>> >
>> >
>> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info
>> (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
>> for partition [nelo2-normal-logs,0] in response to UpdateMetadata request
>> sent by controller 0 epoch 10 with correlation id 13 (state.change.logger)
>> >
>> >
>> >
>> >
>> > Thanks~!
>> >
>> >
>> >
>> > Best Regards
>> >
>> > Jerry
>> >
>> >
>> >
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message