spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: ideas on de duplication for spark streaming?
Date Sat, 24 Sep 2016 16:50:30 GMT
As Cody said, Spark is not going to help you here. 
There are two issues you need to look at here: duplicated (or even more) messages processed
by two different processes and the case of failure of any component (including the message
broker). Keep in mind that duplicated messages can even occur weeks later (e.g. Something
from experience: restart of message broker and message send weeks later again). 
As said, a Dht can help, but you will have a lot of (erroneous) effort to implement it.
You may want to look at (dedicated) redis nodes. Redis has support for partitioning, is very
fast (but please create only one connection/ node and not per lookup) and provides you a lot
of different data structures to solve your problem (e.g. Atomic counters). 

> On 24 Sep 2016, at 08:49, kant kodali <> wrote:
> Hi Guys,
> I have bunch of data coming in to my spark streaming cluster from a message queue(not
kafka). And this message queue guarantees at least once delivery only so there is potential
that some of the messages that come in to the spark streaming cluster are actually duplicates
and I am trying to figure out a best way to filter them ? I was thinking if I should have
a hashmap as a broadcast variable but then I saw that broadcast variables are read only. Also
instead of having a global hashmap variable across every worker node I am thinking Distributed
hash table would be a better idea. any suggestions on how best I could approach this problem
by leveraging the existing functionality?
> Thanks,
> kant

View raw message