If you are looking to eliminate duplicate rows (or similar) then you can define a key from the data and on that key you can do reduceByKey.

What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value?

Maybe you could implement something like this (i don't know if something similar already exists in spark):



Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key?

I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark?