kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Friedman <...@flurry.com>
Subject Re: question about isr
Date Mon, 12 May 2014 19:55:25 GMT
Another question - If I start broker 1 back up now, with his replicas now out of sync, will
he automatically catch up with the leaders or do I have to do something to catch him up?

--Ian 

On May 12, 2014, at 10:49 AM, Jun Rao <junrao@gmail.com> wrote:

> Could you try 0.8.1.1, which fixed some bugs related to controller hanging during shutdown.
> 
> For ZK session expiration, the #1 cause is GC. We have also seen that transient issues
with the underlying storage (e.g. raid controller reset) can also cause ZK session expiration.
> 
> As for the data loss in the producer, you are probably using ack=1, which could lead
to data loss during leader failover. It just that in this case, the failover window is forever
due to some bugs.
> 
> Thanks,
> 
> Jun
> 
> 
> On Sun, May 11, 2014 at 10:14 PM, Ian Friedman <ian@flurry.com> wrote:
> Jun - We're using 0.8.1 
> The timestamps in the last few entries in controller.log seem to correspond to when the
trouble started in server.log
> controller.log:
> 
> [2014-05-08 19:01:27,693] INFO [SessionExpirationListener on 1], ZK expired; shut down
all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
> 
> [2014-05-08 19:01:27,694] INFO [delete-topics-thread], Shutting down (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
> 
> server.log:
> 
> [2014-05-08 19:01:27,382] INFO Closing socket connection to /10.10.13.3. (kafka.network.Processor)
> 
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2. (kafka.network.Processor)
> 
> [2014-05-08 19:01:27,385] INFO Closing socket connection to /10.10.57.2. (kafka.network.Processor)
> 
> [2014-05-08 19:01:29,886] INFO Partition [callbackServiceTopic-Medium,27] on broker 1:
Shrinking ISR for partition [callbackServiceTopic-Medium,27] from 1,2,3 to 1 (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,109] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-Medium/partitions/27/state
with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected
version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/callbackServiceTopic-Medium/partitions/27/state (kafka.utils.ZkUtils$)
> 
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-Medium,27] on broker 1:
Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,109] INFO Partition [callbackServiceTopic-High,3] on broker 1: Shrinking
ISR for partition [callbackServiceTopic-High,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> 
> [2014-05-08 19:01:30,111] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-High/partitions/3/state
with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected
version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/3/state (kafka.utils.ZkUtils$)
> 
> I'm also concerned why this happened at all... I do see some slightly high CMS activity
in the GC log but I don't see any Full GCs and the whole thing was over within a minute...
I'm a bit concerned how something like this could make a broker fail this way!
> 
> One of the after effects we found from leaving the broker running in this state is that
it was still allowing producers with cached metadata to produce to it, so we lost a whole
bunch of messages before I shut him down. What's up with that??
> 
> 
> 
> --Ian
> 
> 
> 
> On Sun, May 11, 2014 at 11:39 PM, Jun Rao <junrao@gmail.com> wrote:
> It seems that broker 1 had a soft failure (potentially due to GC). However, somehow it
didn't receive the latest LeaderAndIsrRequest from the controller. Which version of Kafka
are you using? In the controller log, do you see broker 1 being added back as a live broker?
> 
> Thanks,
> 
> Jun
> 
> 
> On Fri, May 9, 2014 at 10:21 AM, Ian Friedman <ian@flurry.com> wrote:
> This seems similar to behavior we’re seeing. At some point one of our brokers (id 1)
just gives up and starts throwing those errors and kafka-topics no longer lists it as a ISR.
However the logs for that broker say something very odd:
> 
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic-High,8] on broker 1: Cached
zkVersion [10] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,248] INFO Partition [callbackServiceTopic,3] on broker 1: Shrinking
ISR for partition [callbackServiceTopic,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] ERROR Conditional update of path /brokers/topics/callbackServiceTopic/partitions/3/state
with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected
version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/callbackServiceTopic/partitions/3/state (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic,3] on broker 1: Cached
zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,251] INFO Partition [callbackServiceTopic-High,31] on broker 1:
Shrinking ISR for partition [callbackServiceTopic-High,31] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-High/partitions/31/state
with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected
version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/callbackServiceTopic-High/partitions/31/state (kafka.utils.ZkUtils$)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-High,31] on broker 1:
Cached zkVersion [9] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> [2014-05-09 10:16:00,255] INFO Partition [callbackServiceTopic-Low,3] on broker 1: Shrinking
ISR for partition [callbackServiceTopic-Low,3] from 1,2,3 to 1 (kafka.cluster.Partition)
> [2014-05-09 10:16:00,258] ERROR Conditional update of path /brokers/topics/callbackServiceTopic-Low/partitions/3/state
with data {"controller_epoch":4,"leader":1,"version":1,"leader_epoch":4,"isr":[1]} and expected
version 9 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/callbackServiceTopic-Low/partitions/3/state (kafka.utils.ZkUtils$)
> 
> etc. And these errors continue every few seconds.
> 
> kafka-topics.sh —describe output:
> Topic:callbackServiceTopic-High PartitionCount:50       ReplicationFactor:3     Configs:
>         Topic: callbackServiceTopic-High        Partition: 0    Leader: 2       Replicas:
3,1,2 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 1    Leader: 2       Replicas:
1,2,3 Isr: 2,3
>         Topic: callbackServiceTopic-High        Partition: 2    Leader: 2       Replicas:
2,3,1 Isr: 2,3
> 
> I went and looked at one of those ZNodes in the zkCLI and found this:
> 
> [zk: localhost:2181(CONNECTED) 2] get /brokers/topics/callbackServiceTopic-High/partitions/31/state
> {"controller_epoch":5,"leader":2,"version":1,"leader_epoch":5,"isr":[2,3]}
> 
> What does the version number there represent and how does it get out of sync? Should
I restart broker 1? Is the fact that broker 1 is behind in leader_epoch significant?
> 
> Still trying to figure out Kafka operations :(
> —Ian
> 
> 
> On Apr 24, 2014, at 9:26 PM, 陈小军 <chenxiaojun@nhn.com> wrote:
> 
> > I don't do any partition reassignment.
> >
> > When broker occure following error, this phenomenon  will happen.
> >
> > [hadoop@nelo76 libs]$ [2014-03-14 12:11:44,310] INFO Partition [nelo2-normal-logs,0]
on broker 0: Shrinking ISR for partition [nelo2-normal-logs,0] from 0,1 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] ERROR Conditional update of path /brokers/topics/nelo2-normal-logs/partitions/0/state
with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected
version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/nelo2-normal-logs/partitions/0/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-normal-logs,0] on broker 0: Cached
zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,313] INFO Partition [nelo2-symbolicated-logs,1] on broker 0:
Shrinking ISR for partition [nelo2-symbolicated-logs,1] from 0,2 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,315] ERROR Conditional update of path /brokers/topics/nelo2-symbolicated-logs/partitions/1/state
with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":6,"isr":[0]} and expected
version 8 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/nelo2-symbolicated-logs/partitions/1/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,315] INFO Partition [nelo2-symbolicated-logs,1] on broker 0:
Cached zkVersion [8] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,316] INFO Partition [nelo2-crash-logs,1] on broker 0: Shrinking
ISR for partition [nelo2-crash-logs,1] from 0,1 to 0 (kafka.cluster.Partition)
> > [2014-03-14 12:11:44,318] ERROR Conditional update of path /brokers/topics/nelo2-crash-logs/partitions/1/state
with data {"controller_epoch":4,"leader":0,"version":1,"leader_epoch":5,"isr":[0]} and expected
version 7 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /brokers/topics/nelo2-crash-logs/partitions/1/state (kafka.utils.ZkUtils$)
> > [2014-03-14 12:11:44,318] INFO Partition [nelo2-crash-logs,1] on broker 0: Cached
zkVersion [7] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partit
> > Best Regards
> > Jerry
> > -----Original Message-----
> > From: "Jun Rao"&lt;junrao@gmail.com&gt;
> > To: "users@kafka.apache.org"&lt;users@kafka.apache.org&gt;; "陈小军"&lt;chenxiaojun@nhn.com&gt;;
> > Cc:
> > Sent: 2014-04-25 (星期五) 02:12:02
> > Subject: Re: question about isr
> >
> > Interesting. Which version of Kafka are you using? Were you doing some partition
reassignment? Thanks, Jun
> >
> >
> > On Wed, Apr 23, 2014 at 11:14 PM, 陈小军 &lt;chenxiaojun@nhn.com&gt;
wrote:
> >
> > Hi Team,
> >
> >   I found a strange phenomenon of isr list in my kafka cluster
> >
> >
> >
> >   When I use the tool that kafka provide to get the topic information, and it show
isr list as following, seem it is ok
> >
> >
> >
> >     [irteam@xseed171.kdev bin]$ ./kafka-topics.sh --describe --zookeeper 10.96.250.215:10013,10.96.250.216:10013,10.96.250.217:10013/nelo2-kafka
> >
> >
> >
> >
> > Topic:nelo2-normal-logs PartitionCount:3        ReplicationFactor:2     Configs:
> >
> >        Topic: nelo2-normal-logs        Partition: 0    Leader: 3       Replicas:
3,0   Isr: 0,3
> >
> >        Topic: nelo2-normal-logs        Partition: 1    Leader: 0       Replicas:
0,1   Isr: 0,1
> >
> >        Topic: nelo2-normal-logs        Partition: 2    Leader: 1       Replicas:
1,3   Isr: 1,3
> >
> >
> >
> >  but when I use some sdk to get the meta info from broker, the isr is different.
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 2,
> >
> >  errorCode: 0,
> >
> >  partition: 0,
> >
> >  leader: 3,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 3, 0 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 3 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 1,
> >
> >  errorCode: 0,
> >
> >  partition: 1,
> >
> >  leader: 0,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 0, 1 ],
> >
> >  isrNum: 2,
> >
> >  isr: [ 0, 1 ] }
> >
> > metadata:  { size: 246,
> >
> >  correlationId: 0,
> >
> >  brokerNum: -1,
> >
> >  nodeId: 1,
> >
> >  host: 'xseed171.kdev.nhnsystem.com',
> >
> >  port: 9093,
> >
> >  topicNum: 0,
> >
> >  topicError: 0,
> >
> >  topic: 'nelo2-normal-logs',
> >
> >  partitionNum: 0,
> >
> >  errorCode: 0,
> >
> >  partition: 2,
> >
> >  leader: 1,
> >
> >  replicasNum: 2,
> >
> >  replicas: [ 1, 3 ],
> >
> >  isrNum: 1,
> >
> >  isr: [ 1 ] }
> >
> >
> >
> > I also use other sdk, get the same result. I check the logs from kafka, it seems
the sdk result is right. the tool get the wrong result. why is it happend?
> >
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:0,ISR:0,1,LeaderEpoch:7,ControllerEpoch:9),ReplicationFactor:2),AllReplicas:0,1)
for partition [nelo2-normal-logs,1] in response to UpdateMetadata request sent by controller
0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:9,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:1,3)
for partition [nelo2-normal-logs,2] in response to UpdateMetadata request sent by controller
0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> > [2014-04-24 14:53:57,705] TRACE Broker 3 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:0,3,LeaderEpoch:10,ControllerEpoch:10),ReplicationFactor:2),AllReplicas:3,0)
for partition [nelo2-normal-logs,0] in response to UpdateMetadata request sent by controller
0 epoch 10 with correlation id 13 (state.change.logger)
> >
> >
> >
> >
> > Thanks~!
> >
> >
> >
> > Best Regards
> >
> > Jerry
> >
> >
> >
> >
> 
> 
> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message