spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Romi Kuntsman <>
Subject Re: spark as a lookup engine for dedup
Date Mon, 27 Jul 2015 08:26:30 GMT
RDD is immutable, it cannot be changed, you can only create a new one from
data or from transformation. It sounds inefficient to create one each 15
sec for the last 24 hours.
I think a key-value store will be much more fitted for this purpose.

On Mon, Jul 27, 2015 at 11:21 AM Shushant Arora <>

> its for 1 day events in range of 1 billions and processing is in streaming
> application of ~10-15 sec interval so lookup should be fast.  RDD need to
> be updated with new events and old events of current time-24 hours back
> should be removed at each processing.
> So is spark RDD not fit for this requirement?
> On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman <> wrote:
>> What the throughput of processing and for how long do you need to
>> remember duplicates?
>> You can take all the events, put them in an RDD, group by the key, and
>> then process each key only once.
>> But if you have a long running application where you want to check that
>> you didn't see the same value before, and check that for every value, you
>> probably need a key-value store, not RDD.
>> On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <>
>> wrote:
>>> Hi
>>> I have a requirement for processing large events but ignoring duplicate
>>> at the same time.
>>> Events are consumed from kafka and each event has a eventid. It may
>>> happen that an event is already processed and came again at some other
>>> offset.
>>> 1.Can I use Spark RDD to persist processed events and then lookup with
>>> this rdd (How to do lookup inside a RDD ?I have a
>>> JavaPairRDD<eventid,timestamp> )
>>> while processing new events and if event is present in  persisted rdd
>>> ignore it , else process the even. Does rdd.lookup(key) on billion of
>>> events will be efficient ?
>>> 2. update the rdd (Since RDD is immutable  how to update it)?
>>> Thanks

View raw message