kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ofir Sharony <ofir.shar...@myheritage.com>
Subject Deduplicating KStream-KStream join
Date Thu, 04 May 2017 15:26:31 GMT
Hi guys,

I want to perform a join between two KStreams.
An event may appear only on one of the streams (either one of them), so I
can't use inner join (which emits only on a match) or left join (which
emits only when the left input arrives).
This leaves me with outer join. The problem with outer join is that it
emits on every record arrival, which creates duplicates at the output node.

My downstream application is BigQuery, which doesn't support updates, thus
can't do the dedup by itself.
What is the best practice implementing deduplication in KafkaStreams,
keeping only the latest, most updated record?
Is it possible to emit a record only after some time has passed, or upon a
certain trigger?


*Ofir Sharony*
BackEnd Tech Lead

Mobile: +972-54-7560277 <+972%2054-756-0277> | ofir.sharony@myheritage.com
| www.myheritage.com
MyHeritage Ltd., 3 Ariel Sharon St., Or Yehuda 60250, Israel


<https://twitter.com/myheritage>         <http://blog.myheritage.com/>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message