spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Dedup
Date Thu, 09 Oct 2014 07:03:48 GMT
I think the question is about copying the argument. If it's an
immutable value like String, yes just return the first argument and
ignore the second. If you're dealing with a notoriously mutable value
like a Hadoop Writable, you need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling
all of those duplicates together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but
would be faster, to do something like this: mapPartitions and retain
one input value each unique dedup criteria, and then output those
pairs, and then reduceByKey the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) <> wrote:
> I need to do deduplication processing in Spark. The current plan is to
> generate a tuple where key is the dedup criteria and value is the original
> input. I am thinking to use reduceByKey to discard duplicate values. If I do
> that, can I simply return the first argument or should I return a copy of
> the first argument. Is there are better way to do dedup in Spark?
> -Yao

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message