kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Røigaard-Petersen <...@edlund.dk>
Subject RE: Official Kafka Disaster Recovery is insufficient - Suggestions needed
Date Tue, 04 Sep 2018 14:10:17 GMT
Thank you for your answer Ryanne. It’s always a pleasure to be presented with such unambiguous
advice. It really gives you something to work with :). 

To any other readers, I am very interested in hearing of other approaches to DR in Kafka.


Ryanne, I agree with your statement as to the probability difference of the different DR scenarios,
and I get in principle how your approach would allow us to recover from “bad” messages,
but we must of course ensure that we have counter measures for all the scenarios. 

To that end, I have a couple of questions to your approach to DR.

Q1) 
It is my understanding that you produce messages to Kafka partitions using the normal producer
API and then subsequently ETL them to some cold storage using one or more consumers, i.e.
the cold storage is eventually consistent with Kafka!? 

If this is true, isn’t your approach prone to the same loss-of-tail issues as regular multi
cluster replication in case of total ISR loss? That is, we may end up with an inconsistent
cold storage, because downstream messages may be backed up before the corresponding upstream
messages are backed up?

I guess some ways around this would be to have only one partition (not feasible) or to store
state changes directly to other storage and ETL those changes back to Kafka for downstream
consummation. However, I believe that is not what you are suggesting.

Q2) 
I am unsure how you approach should work in practice, concerning the likely disaster scenario
of bad messages. 

Assume a bad message is produced and ETL’ed to the cold storage. 

As an isolated message, we could simply wipe the Kafka partition and reproduce all relevant
messages or compact the bad message with a newer version. This all makes sense. 
However, more likely, it will not be an isolated bad message, but rather a plethora of downstream
consumers will process it and in turn produce derived bad messages, which are further processed
downstream. This could result in an enormous amount of bad messages and bad state in cold
storage. 

How would you recover in this case? 

It might be possible to iterate through the entirety of the state to detect bad messages,
but updating with the correct data seems impossible.

I guess one very crude fallback solution may be to identify the root bad message, and somehow
restore to a previous consistent state for the entire system. This however, requires some
global message property across the entire system. You mention Timestamps, but traditionally
these are intrinsically unreliable, especially in a distributed environment, and will most
likely lead to loss of messages with timestamps close to the root bad message.

Q3) 
Does the statement “Don't rely on unlimited retention in Kafka” imply some flaw in the
implementation, or is it simply a reference to the advice of not using Kafka as Source of
Truth due to the DR issues?

Thank you for your time

Henning Røigaard-Petersen

-----Original Message-----
From: Ryanne Dolan <ryannedolan@gmail.com> 
Sent: 3. september 2018 20:27
To: users@kafka.apache.org
Subject: Re: Official Kafka Disaster Recovery is insufficient - Suggestions needed

Sorry to have misspelled your name Henning.

On Mon, Sep 3, 2018, 1:26 PM Ryanne Dolan <ryannedolan@gmail.com> wrote:

> Hanning,
>
> In mission-critical (and indeed GDPR-related) applications, I've ETL'd 
> Kafka to a secondary store e.g. HDFS, and built tooling around 
> recovering state back into Kafka. I've had situations where data is 
> accidentally or incorrectly ingested into Kafka, causing downstream 
> systems to process bad data. This, in my experience, is astronomically 
> more likely than the other DR scenarios you describe. But my approach is the same:
>
> - don't treat Kafka as a source-of-truth. It is hard to fix data in an 
> append-only log, so we can't trust it to always be correct.
>
> - ETL Kafka topics to a read-only, append-only, indexable log e.g. in 
> HDFS, and then build tooling to reingest data from HDFS back into Kafka.
> That way in the event of disaster, data can be recovered from cold storage.
> Don't rely on unlimited retention in Kafka.
>
> - build everything around log compaction, keys, and idempotency. 
> People think compaction is just to save space, but it is also the only 
> way to layer new records over existing records in an otherwise 
> append-only environment. I've built pipelines that let me surgically 
> remove or fix records at rest and then submit them back to Kafka. 
> Since these records have the same key, processors will treat them as 
> replacements to earlier records. Additionally, processors should honor 
> timestamps and/or sequence IDs and not the actual order of records in 
> a partition. That way old data can be ingested from HDFS -> Kafka idempotently.
>
> Imagine that one record out of millions is bad and you don't notice it 
> for months. You can grab this record from HDFS, modify it, and then 
> submit it back to Kafka. Even tho it is in the stream months later 
> than real-time, processors will treat it as a replacement for the old 
> bad record and the entire system will end up exactly as if the record 
> was never bad. If you can achieve these semantics consistently, DR is straightforward.
>
> - don't worry too much about offsets wrt consumer progress. Timestamps 
> are usually more important. If the above is in place, it doesn't 
> matter if you skip some records during failover. Just reprocess a few 
> hours from cold storage and it's like the failure never happened.
>
> Ryanne
>
> On Mon, Sep 3, 2018, 9:49 AM Henning Røigaard-Petersen <hrp@edlund.dk>
> wrote:
>
>> I am looking for advice on how to handle disasters not covered by the 
>> official methods of replication, whether intra-cluster replication 
>> (via replication factor and producer acks) or multi-cluster 
>> replication (using Confluent Replicator).
>>
>>
>>
>> We are looking into using Kafka not only as a broker for decoupling, 
>> but also as event store.
>>
>> For us, data is mission critical and has unlimited retention with the 
>> option to compact (required by GDPR).
>>
>>
>>
>> We are especially interested in two types of disasters:
>>
>> 1.       Unintentional or malicious use of the Kafka API to create or
>> compact messages, as well as deletions of topics or logs.
>>
>> 2.       Loss of tail data from partitions if all ISR fail in the
>> primary cluster. A special case of this is the loss of the tail from 
>> the commit topic, which results in loss of consumer progress.
>>
>>
>>
>> Ad (1)
>>
>> Replication does not guard against disaster (1), as the resulting 
>> messages and deletions spread to the replicas.
>>
>> A naive solution is to simply secure the cluster, and ensure that 
>> there are no errors in the code (…), but anyone with an ounce of 
>> experience with software knows that stuff goes wrong.
>>
>>
>>
>> Ad (2)
>>
>> The potential of disaster (2) is much worse. In case of total data 
>> center loss, the secondary cluster will lose the tail of every 
>> partition not fully replicated. Besides the data loss itself, there 
>> is now inconsistencies between related topics and partitions, which breaks the state
of the system.
>>
>> Granted, the likelihood of total center loss is not great, but there 
>> is a reason we have multi-center setups.
>>
>> The special case with loss of consumer offsets results in double 
>> processing of events, once we resume processing from an older offset. 
>> While idempotency is a solution, it might not always be possible nor desirable.
>>
>>
>>
>> Once any of these types of disaster has occurred, there is no means 
>> to restore lost data, and even worse, we cannot restore the cluster 
>> to a point in time where it is consistent. We could probably look our 
>> customers in the eyes and tell them that they lost a days worth of 
>> progress, but we cannot inform them of a state of perpetual inconsistency.
>>
>>
>>
>> The only solution we can think of right now is to shut the primary 
>> cluster down (ensuring that no new events are produced), and then 
>> copy all files to some secure location, i.e. effectively creating a 
>> backup, allowing restoration to a consistency point in time. Though 
>> the tail (backup-wise) will be lost in case of disaster, we are 
>> ensured a consistent state to restore to.
>>
>> As a useful side effect, such backup-files can also be used to create 
>> environments for test or other destructive purposes.
>>
>>
>>
>> Does anyone have a better idea?
>>
>>
>>
>>
>>
>> Venlig hilsen / Best regards
>>
>> *Henning Røigaard-Petersen *
>> Principal Developer
>> MSc in Mathematics
>>
>> hrp@edlund.dk
>>
>> Direct phone
>>
>> +45 36150272
>>
>>
>> [image: https://www.edlund.dk/sites/default/files/edlundnylogo1.png]
>> <http://www.edlund.dk/>
>>
>> Edlund A/S
>> La Cours Vej 7
>> <https://maps.google.com/?q=La+Cours+Vej+7&entry=gmail&source=g>
>> DK-2000 Frederiksberg
>> Tel +45 36150630
>> www.edlund.dk
>>
>
Mime
View raw message