spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <>
Subject Re: Dedup
Date Wed, 08 Oct 2014 19:57:18 GMT
Multiple values may be different, yet still be considered duplicates
depending on how the dedup criteria is selected. Is that correct? Do you
care in that case what value you select for a given key?

On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <> wrote:

>  I need to do deduplication processing in Spark. The current plan is to
> generate a tuple where key is the dedup criteria and value is the original
> input. I am thinking to use reduceByKey to discard duplicate values. If I
> do that, can I simply return the first argument or should I return a copy
> of the first argument. Is there are better way to do dedup in Spark?
> -Yao

View raw message