kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gleb Zhukov <gzhu...@iponweb.net>
Subject Re: Kafka loses data after one broker reboot
Date Wed, 02 Sep 2015 14:16:18 GMT
Hi, Lukas.
I don't think so because we lost data with offsets which are larger than
exist offsets. But retention should delete offsets less than exist offsets.

Hi, Gwen.
Thanks a lot. We will sleep on it.


On Wed, Sep 2, 2015 at 1:07 AM, Gwen Shapira <gwen@confluent.io> wrote:

> There is another edge-case that can lead to this scenario. It is described
> in detail in KAFKA-2134, and I'll copy Becket's excellent summary here for
> reference:
>
> 1) Broker A (leader) has committed offset up-to 5000
> 2) Broker B (follower) has committed offset up to 3000 (he is still in ISR
> because of replica.lag.max.messages)
> ***network glitch happens***
> 3) Broker B becomes a leader, Broker A becomes a follower
> 4) Broker A fetch from Broker B and got an OffsetOutOfRangeException (since
> it is trying to fetch offset over 5000 while B only has 3000).
> 4.1) Broker B got some new message and its log end offset become 6000.
> 4.2) Broker A tries to handle the Exception so it checks the log end offset
> on broker B (Followers that get OffsetOutOfRangeException try to reset to
> the LogEndOffset of the leader) and found it is greater than broker A's log
> end offset so it truncate itself to the starting offset of Broker B
> (causing significant replication to happen).
>
> You can avoid data loss in these scenarios by setting acks = -1 on all
> producers and unclean.leader.election = false.
> This guarantees that messages that got "acked" to producers exist on all
> in-sync replicas, and that an unsynced replica will never become a leader
> of a partition. So when a scenario like above occurs, it means that
> messages 3000-5000 (which were lost) were never acked to producers (and
> hopefully the producers will retry sending these messages).
>
> Gwen
>
> On Tue, Sep 1, 2015 at 9:56 AM, Lukas Steiblys <lukas@doubledutch.me>
> wrote:
>
> > Hi Gleb,
> >
> > Are you sure you were not trying to read data that had already exceeded
> > its retention policy?
> >
> > Lukas
> >
> > -----Original Message----- From: Gleb Zhukov
> > Sent: Tuesday, September 1, 2015 6:26 AM
> > To: users@kafka.apache.org
> > Subject: Kafka loses data after one broker reboot
> >
> >
> > Hi, All!
> > We use 3 kafka's brokers with replica-factor 2. Today we were doing
> > partitions reassignment and one of our brokers was rebooted due some
> > hardware problem. After broker had returned back to work we found that
> our
> > consumer doesn't work with errors like:
> >
> > ERROR java.lang.AssertionError: assumption failed: 765994 exceeds 6339
> > ERROR java.lang.AssertionError: assumption failed: 1501252 exceeds 416522
> > ERROR java.lang.AssertionError: assumption failed: 950819 exceeds 805377
> >
> > Some logs from broker:
> >
> > [2015-09-01 13:00:16,976] ERROR [Replica Manager on Broker 61]: Error
> when
> > processing fetch request for partition [avro_match,27] offset 208064729
> > from consumer with correlation id 0. Possible cause: Request for offset
> > 208064729 but we only have log segments in the range 209248794 to
> > 250879159. (kafka.server.ReplicaManager)
> > [2015-09-01 13:01:17,943] ERROR [Replica Manager on Broker 45]: Error
> when
> > processing fetch request for partition [logs.conv_expired,20] offset 454
> > from consumer with correlation id 0. Possible cause: Request for offset
> 454
> > but we only have log segments in the range 1349769 to 1476231.
> > (kafka.server.ReplicaManager)
> >
> > [2015-09-01 13:21:23,896] INFO Partition [logs.avro_event,29] on broker
> 61:
> > Expanding ISR for partition [logs.avro_event,29] from 61,77 to 61,77,45
> > (kafka.cluster.Partition)
> > [2015-09-01 13:21:23,899] INFO Partition [logs.imp_tstvssamza,6] on
> broker
> > 61: Expanding ISR for partition [logs.imp_tstvssamza,6] from 61,77 to
> > 61,77,45 (kafka.cluster.Partition)
> > [2015-09-01 13:21:23,902] INFO Partition [__consumer_offsets,30] on
> broker
> > 61: Expanding ISR for partition [__consumer_offsets,30] from 61,77 to
> > 61,77,45 (kafka.cluster.Partition)
> > [2015-09-01 13:21:23,905] INFO Partition [logs.test_imp,44] on broker 61:
> > Expanding ISR for partition [logs.test_imp,44] from 61 to 61,45
> > (kafka.cluster.Partition)
> >
> > Looks like we lost part of our data.
> >
> > Also kafka started to replicating random partitions (bad broker was
> already
> > up and running, log recovery was completed):
> >
> > root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
> > zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc
> > -l
> > Tue Sep  1 13:02:24 UTC 2015
> > 431
> > root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
> > zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc
> > -l
> > Tue Sep  1 13:02:37 UTC 2015
> > 386
> > root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
> > zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc
> > -l
> > Tue Sep  1 13:02:48 UTC 2015
> > 501
> > root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
> > zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc
> > -l
> > Tue Sep  1 13:02:58 UTC 2015
> > 288
> > root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
> > zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc
> > -l
> > Tue Sep  1 13:03:08 UTC 2015
> > 363
> >
> > Could anyone throw some light on this situation?
> >
> > We use ext4 on our brokers and these settings:
> >
> > port=9092
> > num.network.threads=2
> > num.io.threads=8
> > socket.send.buffer.bytes=1048576
> > socket.receive.buffer.bytes=1048576
> > socket.request.max.bytes=104857600
> > log.dirs=/mnt/kafka/kafka-data
> > num.partitions=1
> > default.replication.factor=2
> > message.max.bytes=10000000
> > replica.fetch.max.bytes=10000000
> > auto.create.topics.enable=false
> > log.roll.hours=24
> > num.replica.fetchers=4
> > auto.leader.rebalance.enable=true
> > log.retention.hours=168
> > log.segment.bytes=134217728
> > log.retention.check.interval.ms=60000
> > log.cleaner.enable=false
> > delete.topic.enable=true
> >
> >
> zookeeper.connect=zk1d.gce-eu.kafka:2181,zk2d.gce-eu.kafka:2181,zk3d.gce-eu.kafka:2181/kafka
> > zookeeper.connection.timeout.ms=6000
> >
> > Shall I do something with these parameters or cluster with 3 brokers and
> > with replica-factor=2 should prevent such issues?
> >
> > log.flush.interval.ms
> > log.flush.interval.messageslog.flush.scheduler.interval.ms
> >
> > THX!
> >
> >
> > --
> > Best regards,
> > Gleb Zhukov
> >
>



-- 
Best regards,
Gleb Zhukov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message