kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip O'Toole" <phi...@loggly.com>
Subject Re: Unpacking "strong durability and fault-tolerance guarantees"
Date Sun, 07 Jul 2013 19:50:11 GMT
Disclaimer: I only have experience right now with 0.7.2

On Sun, Jul 7, 2013 at 11:35 AM, David James <davidcjames@gmail.com> wrote:
> Sorry for the long email, but I've tried to keep it organized, at least.
> "Kafka has a modern cluster-centric design that offers strong
> durability and fault-tolerance guarantees." and "Messages are
> persisted on disk and replicated within the cluster to prevent data
> loss." according to http://kafka.apache.org/.
> I'm trying to understand what this means in some detail. So, two questions.
> 1. Fault-Tolerance
> If a Broker in a Kafka cluster fails (the EC2 instance dies), what
> happens? After, let's say I add a new Broker to the cluster (that my
> responsibility, not Kafka's). What happens when it rejoins?
> To be more particular, if the cluster consists of a Zookeeper and B
> (3, for example) Brokers, can a Kafka system guarantee to tolerate up
> to B-1 (2, for example) Broker failures?
> 2. Durability at an application level
> What are the guarantees about durability, at an application level, in
> practice? By "application level" I mean guarantees that a produced
> message gets consumed and acted upon by an application that uses
> Kafka. My understanding at present is that Kafka does not make these
> kinds of guarantees because there are no acks. So, it is up to the
> application developer to handle it. Is this right?


> Here's my understanding: Having messages persisted on disk and
> replicated is why Kafka has durability guarantees. But, from an
> application perspective, what happens when a consumer pulls a message
> but fails before acting on it? That would update the Kafka consumer
> offset, right? So, without some thinking and planning ahead on the
> Kafka system design, the application's consumers would not have a way
> of knowing that a message was not actually processed.

Don't update the offset until message has been fully processed. This
means you need to build downstream systems that can accept messages
being replayed, since a message may be processed but the consumer
crash before the offset it updated. Or at least have a process in
place to deal with clean-up, in the event of a crash.

> Conclusion / Last Question
> I'm interested in making the chance of message loss minimal, at a
> system level. Any pointers on what to read or think about would be
> appreciated!
> Thanks!
> -David

View raw message