spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Romi Kuntsman <r...@totango.com>
Subject Re: spark as a lookup engine for dedup
Date Mon, 27 Jul 2015 07:38:39 GMT
What the throughput of processing and for how long do you need to remember
duplicates?

You can take all the events, put them in an RDD, group by the key, and then
process each key only once.
But if you have a long running application where you want to check that you
didn't see the same value before, and check that for every value, you
probably need a key-value store, not RDD.

On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <shushantarora09@gmail.com>
wrote:

> Hi
>
> I have a requirement for processing large events but ignoring duplicate at
> the same time.
>
> Events are consumed from kafka and each event has a eventid. It may happen
> that an event is already processed and came again at some other offset.
>
> 1.Can I use Spark RDD to persist processed events and then lookup with
> this rdd (How to do lookup inside a RDD ?I have a
> JavaPairRDD<eventid,timestamp> )
> while processing new events and if event is present in  persisted rdd
> ignore it , else process the even. Does rdd.lookup(key) on billion of
> events will be efficient ?
>
> 2. update the rdd (Since RDD is immutable  how to update it)?
>
> Thanks
>
>

Mime
View raw message