kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Wilcox <andrew.wil...@gmail.com>
Subject Deduplicating a topic in the face of producer crashes over a time window?
Date Fri, 02 Nov 2018 00:33:45 GMT
Suppose I have a producer which is ingesting a data stream with unique keys
from an external service and sending it to a Kafka topic.  In my producer I
can set enable.idempotence and get exactly-once delivery in the presence of
broker crashes.  However my producer might crash after it delivers a batch
of messages to Kafka but before it records that the batch was delivered.
After restarting the crashed producer it would re-deliver the same batch,
resulting in duplicate messages in the topic.

With a streams transformer I can deduplicate the topic by using a state
store to record previously seen keys and then only creating an output
record if the key hasn't been seen before.  However without a mechanism to
remove old keys the state store will grow without bound.

Say I only want to deduplicate over a time period such as one day.  (I'm
confident that I'll be able to restart a crashed producer sooner).  Thus
I'd like keys older than a day to expire out of the state store, so the
store only needs to keep track of keys seen in the last day or so.

Is there a way to do this with Kafka streams?  Or is there another
recommended mechanism to keep messages with unique keys unduplicated in the
presence of producer crashes?

Thanks!

Andrew

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message