If you are looking to eliminate duplicate rows (or similar) then you can define a key from the data and on that key you can do reduceByKey.

Thanks
Best Regards

On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:
What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value?

Best Regards,
Sonal
Nube Technologies 





On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier <pompermaier@okkam.it> wrote:

Maybe you could implement something like this (i don't know if something similar already exists in spark):

http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf

Best,
Flavio

On Oct 8, 2014 9:58 PM, "Nicholas Chammas" <nicholas.chammas@gmail.com> wrote:
Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key?

On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <yge@ford.com> wrote:

I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark?

 

-Yao