spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanish Kushary <himan...@gmail.com>
Subject Re:
Date Wed, 25 Mar 2015 19:11:32 GMT
It will only give (A,B). I am generating the pair from combinations of the
the strings A,B,C and D, so the pairs (ignoring order) would be

(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)

On successful filtering using the original condition it will transform to
(A,B) and (C,D)

On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:

> What would it do with the following dataset?
>
> (A, B)
> (A, C)
> (B, D)
>
>
> On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary <himanish@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a RDD of pairs of strings like below :
>>
>> (A,B)
>> (B,C)
>> (C,D)
>> (A,D)
>> (E,F)
>> (B,F)
>>
>> I need to transform/filter this into a RDD of pairs that does not repeat
>> a string once it has been used once. So something like ,
>>
>> (A,B)
>> (C,D)
>> (E,F)
>>
>> (B,C) is out because B has already ben used in (A,B), (A,D) is out
>> because A (and D) has been used etc.
>>
>> I was thinking of a option of using a shared variable to keep track of
>> what has already been used but that may only work for a single partition
>> and would not scale for larger dataset.
>>
>> Is there any other efficient way to accomplish this ?
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish

Mime
View raw message