spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shushant Arora <>
Subject Re: spark as a lookup engine for dedup
Date Mon, 27 Jul 2015 08:21:23 GMT
its for 1 day events in range of 1 billions and processing is in streaming
application of ~10-15 sec interval so lookup should be fast.  RDD need to
be updated with new events and old events of current time-24 hours back
should be removed at each processing.

So is spark RDD not fit for this requirement?

On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman <> wrote:

> What the throughput of processing and for how long do you need to remember
> duplicates?
> You can take all the events, put them in an RDD, group by the key, and
> then process each key only once.
> But if you have a long running application where you want to check that
> you didn't see the same value before, and check that for every value, you
> probably need a key-value store, not RDD.
> On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <>
> wrote:
>> Hi
>> I have a requirement for processing large events but ignoring duplicate
>> at the same time.
>> Events are consumed from kafka and each event has a eventid. It may
>> happen that an event is already processed and came again at some other
>> offset.
>> 1.Can I use Spark RDD to persist processed events and then lookup with
>> this rdd (How to do lookup inside a RDD ?I have a
>> JavaPairRDD<eventid,timestamp> )
>> while processing new events and if event is present in  persisted rdd
>> ignore it , else process the even. Does rdd.lookup(key) on billion of
>> events will be efficient ?
>> 2. update the rdd (Since RDD is immutable  how to update it)?
>> Thanks

View raw message