What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value?Best Regards,
Nube TechnologiesOn Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier <firstname.lastname@example.org> wrote:
Maybe you could implement something like this (i don't know if something similar already exists in spark):
FlavioOn Oct 8, 2014 9:58 PM, "Nicholas Chammas" <email@example.com> wrote:Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key?On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <firstname.lastname@example.org> wrote:
I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark?