spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: Dedup
Date Thu, 09 Oct 2014 05:00:02 GMT
What is your data like? Are you looking at exact matching or are you
interested in nearly same records? Do you need to merge similar records to
get a canonical value?

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:

> Maybe you could implement something like this (i don't know if something
> similar already exists in spark):
>
> http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
>
> Best,
> Flavio
> On Oct 8, 2014 9:58 PM, "Nicholas Chammas" <nicholas.chammas@gmail.com>
> wrote:
>
>> Multiple values may be different, yet still be considered duplicates
>> depending on how the dedup criteria is selected. Is that correct? Do you
>> care in that case what value you select for a given key?
>>
>> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <yge@ford.com> wrote:
>>
>>>  I need to do deduplication processing in Spark. The current plan is to
>>> generate a tuple where key is the dedup criteria and value is the original
>>> input. I am thinking to use reduceByKey to discard duplicate values. If I
>>> do that, can I simply return the first argument or should I return a copy
>>> of the first argument. Is there are better way to do dedup in Spark?
>>>
>>>
>>>
>>> -Yao
>>>
>>
>>

Mime
View raw message