kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrik Kleindl <pklei...@gmail.com>
Subject Re: Deduplicating a topic in the face of producer crashes over a time window?
Date Fri, 02 Nov 2018 06:37:53 GMT
Hi Andrew

Did you take a look at
We are using this for a case like you described.
Growth should be limited with this approach.

Best regards 

> Am 02.11.2018 um 01:33 schrieb Andrew Wilcox <andrew.wilcox@gmail.com>:
> Suppose I have a producer which is ingesting a data stream with unique keys
> from an external service and sending it to a Kafka topic.  In my producer I
> can set enable.idempotence and get exactly-once delivery in the presence of
> broker crashes.  However my producer might crash after it delivers a batch
> of messages to Kafka but before it records that the batch was delivered.
> After restarting the crashed producer it would re-deliver the same batch,
> resulting in duplicate messages in the topic.
> With a streams transformer I can deduplicate the topic by using a state
> store to record previously seen keys and then only creating an output
> record if the key hasn't been seen before.  However without a mechanism to
> remove old keys the state store will grow without bound.
> Say I only want to deduplicate over a time period such as one day.  (I'm
> confident that I'll be able to restart a crashed producer sooner).  Thus
> I'd like keys older than a day to expire out of the state store, so the
> store only needs to keep track of keys seen in the last day or so.
> Is there a way to do this with Kafka streams?  Or is there another
> recommended mechanism to keep messages with unique keys unduplicated in the
> presence of producer crashes?
> Thanks!
> Andrew

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message