spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ge, Yao (Y.)" <>
Subject RE: Dedup
Date Thu, 09 Oct 2014 11:32:01 GMT
Yes. I was using String array as arguments in the reduceByKey. I think String array is actually
immutable and simply returning the first argument without cloning one should work. I will
look into mapPartitions as we can have up to 40% duplicates. Will follow up on this if necessary.
Thanks very much Sean!


-----Original Message-----
From: Sean Owen [] 
Sent: Thursday, October 09, 2014 3:04 AM
To: Ge, Yao (Y.)
Subject: Re: Dedup

I think the question is about copying the argument. If it's an immutable value like String,
yes just return the first argument and ignore the second. If you're dealing with a notoriously
mutable value like a Hadoop Writable, you need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling all of those duplicates
together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but would be faster, to do
something like this: mapPartitions and retain one input value each unique dedup criteria,
and then output those pairs, and then reduceByKey the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) <> wrote:
> I need to do deduplication processing in Spark. The current plan is to 
> generate a tuple where key is the dedup criteria and value is the 
> original input. I am thinking to use reduceByKey to discard duplicate 
> values. If I do that, can I simply return the first argument or should 
> I return a copy of the first argument. Is there are better way to do dedup in Spark?
> -Yao
View raw message