spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Dedup
Date Thu, 09 Oct 2014 06:50:35 GMT
If you are looking to eliminate duplicate rows (or similar) then you can
define a key from the data and on that key you can do reduceByKey.

Thanks
Best Regards

On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:

> What is your data like? Are you looking at exact matching or are you
> interested in nearly same records? Do you need to merge similar records to
> get a canonical value?
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>
>> Maybe you could implement something like this (i don't know if something
>> similar already exists in spark):
>>
>> http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
>>
>> Best,
>> Flavio
>> On Oct 8, 2014 9:58 PM, "Nicholas Chammas" <nicholas.chammas@gmail.com>
>> wrote:
>>
>>> Multiple values may be different, yet still be considered duplicates
>>> depending on how the dedup criteria is selected. Is that correct? Do you
>>> care in that case what value you select for a given key?
>>>
>>> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <yge@ford.com> wrote:
>>>
>>>>  I need to do deduplication processing in Spark. The current plan is
>>>> to generate a tuple where key is the dedup criteria and value is the
>>>> original input. I am thinking to use reduceByKey to discard duplicate
>>>> values. If I do that, can I simply return the first argument or should I
>>>> return a copy of the first argument. Is there are better way to do dedup
in
>>>> Spark?
>>>>
>>>>
>>>>
>>>> -Yao
>>>>
>>>
>>>
>

Mime
View raw message