kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Bejeck <b...@confluent.io>
Subject Re: deduplication strategy for Kafka Streams DSL
Date Wed, 13 Dec 2017 14:42:53 GMT
Hi Artur,

The most direct way for deduplication (I'm using the term deduplication to
mean records with the same key, but not necessarily the same value, where
later records are considered) is to set the  CACHE_MAX_BYTES_BUFFERING_CONFIG
setting to a value greater than zero.

Your other option is to use the PAPI and by writing your own logic in
conjunction with a state store determine what constitutes a duplicate and
when to emit a record.  You could take the same approach in the DSL layer
using a Transformer.



On Wed, Dec 13, 2017 at 7:00 AM, Artur Mrozowski <artmro@gmail.com> wrote:

> Hi
> I run an app where I transform KTable to stream and then I groupBy and
> aggregate and capture the results in KTable again. That generates many
> duplicates.
> I have played with exactly once semantics that seems to reduce duplicates
> for records that should be unique. But I still get duplicates on keys that
> have two or more records.
> I could not reproduce it on small number of records so I disable caching by
> setting CACHE_MAX_BYTES_BUFFERING_CONFIG to 0. Surely enough, I got loads
> of duplicates, even these previously eliminated by exactly once semantics.
> Now I have hard time to enable it again on Confluent 3.3.
> But, generally what it the best deduplication strategy for Kafka Streams?
> Artur

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message