kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Røigaard-Petersen <...@edlund.dk>
Subject Official Kafka Disaster Recovery is insufficient - Suggestions needed
Date Mon, 03 Sep 2018 14:49:12 GMT
I am looking for advice on how to handle disasters not covered by the official methods of replication,
whether intra-cluster replication (via replication factor and producer acks) or multi-cluster
replication (using Confluent Replicator).

We are looking into using Kafka not only as a broker for decoupling, but also as event store.
For us, data is mission critical and has unlimited retention with the option to compact (required
by GDPR).

We are especially interested in two types of disasters:

1.       Unintentional or malicious use of the Kafka API to create or compact messages, as
well as deletions of topics or logs.

2.       Loss of tail data from partitions if all ISR fail in the primary cluster. A special
case of this is the loss of the tail from the commit topic, which results in loss of consumer

Ad (1)
Replication does not guard against disaster (1), as the resulting messages and deletions spread
to the replicas.
A naive solution is to simply secure the cluster, and ensure that there are no errors in the
code (...), but anyone with an ounce of experience with software knows that stuff goes wrong.

Ad (2)
The potential of disaster (2) is much worse. In case of total data center loss, the secondary
cluster will lose the tail of every partition not fully replicated. Besides the data loss
itself, there is now inconsistencies between related topics and partitions, which breaks the
state of the system.
Granted, the likelihood of total center loss is not great, but there is a reason we have multi-center
The special case with loss of consumer offsets results in double processing of events, once
we resume processing from an older offset. While idempotency is a solution, it might not always
be possible nor desirable.

Once any of these types of disaster has occurred, there is no means to restore lost data,
and even worse, we cannot restore the cluster to a point in time where it is consistent. We
could probably look our customers in the eyes and tell them that they lost a days worth of
progress, but we cannot inform them of a state of perpetual inconsistency.

The only solution we can think of right now is to shut the primary cluster down (ensuring
that no new events are produced), and then copy all files to some secure location, i.e. effectively
creating a backup, allowing restoration to a consistency point in time. Though the tail (backup-wise)
will be lost in case of disaster, we are ensured a consistent state to restore to.
As a useful side effect, such backup-files can also be used to create environments for test
or other destructive purposes.

Does anyone have a better idea?

Venlig hilsen / Best regards

Henning Røigaard-Petersen
Principal Developer
MSc in Mathematics

hrp@edlund.dk <mailto:hrp@edlund.dk>
Direct phone

+45 36150272


Edlund A/S
La Cours Vej 7
DK-2000 Frederiksberg
Tel +45 36150630

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message