kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Grasso <agra...@janestreet.com>
Subject Re: Data loss when ack != -1
Date Mon, 10 Oct 2016 13:49:09 GMT
Hi Justin,

Setting the required acks to -1 does not require that all assigned brokers
are available, only that all members of the ISR are available. If a broker
goes down, the producer is able to commit messages once the faulty broker
is evicted from the ISR. This can continue even if only one broker is
alive, in which case only that broker will be eligible to be leader. If
you'd like to ensure that all committed messages are present on at least N
machines, set min.insync.replicas to N and required acks to -1.

-Andrew

On Fri, Oct 7, 2016 at 5:05 PM, Justin Lin <linjianfengqrh@gmail.com> wrote:

> Hi everyone,
>
> I am currently running kafka 0.8.1.1 in a cluster, with 6 brokers and i set
> the replication factor to 3. My producer set the ack to be 2 when producing
> messages. I recently came across a bad situation that i had to reboot one
> broker machine by shutdown the power, and that caused data loss.
>
> This is what actually happened.
>
> Producer 1(PD1) sends message (M100) to Partition 10 (leader h1, ISR h1,
> h2, h3) and since the ack == 2, so as long as there are two brokers
> acknowledged, M100 is considered as committed and ready for consumer.  So
> h1 and h2 got M100 and consumer (C1) pulls M100 down and handle the
> message. So far so good, we are just waiting for h3 to catch up.
> But before that, h1 gets shutdown and h3 doesn't get the change the get
> M100, while still in ISR. So partition 88 will choose a new leader from h2
> and h3. And somehow (randomly) it chooses h3 so M100 in h2 will be
> truncated and the data is lost.
> But this is not the worst part, because consumer C1 already got M100. After
> C1 handled the message it commits it's offset(100) back to a key value
> store and started to pull message 101 from new leader h3. Since h3 doesn't
> have the M100, it responded with error "Offset out of bound".
> Now Producer PD1 Keeps producing messages to partition 88, say it produces
> two message (M1 and M2), The offset of M1 and M2 in h3 is 100 and 101. Now
> consumer C1 pulls the messages from h3 at offset 101, it sees one message
> M2. There M1 will never be processed by consumer.
>
> This is extremely bad because the producer get acknowledgement but the
> consumer will never be able to process the message.
>
> I googled a bit on how to solve the problem. Most of the post suggest to
> change the ack to be -1(all). That is also prone to failure since now if
> one broker is down, producers will lose the ability to produce any data.
>
> I want to seek for more wisdom on how to solve this problem in the
> community. Any idea or previous experience is welcome.
>
> Thanks ahead.
>
> --
> come on
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message