spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kronenfeld <nkronenf...@uncharted.software>
Subject Re:
Date Wed, 25 Mar 2015 19:09:00 GMT
What would it do with the following dataset?

(A, B)
(A, C)
(B, D)

On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary <himanish@gmail.com>
wrote:

> Hi,
>
> I have a RDD of pairs of strings like below :
>
> (A,B)
> (B,C)
> (C,D)
> (A,D)
> (E,F)
> (B,F)
>
> I need to transform/filter this into a RDD of pairs that does not repeat a
> string once it has been used once. So something like ,
>
> (A,B)
> (C,D)
> (E,F)
>
> (B,C) is out because B has already ben used in (A,B), (A,D) is out because
> A (and D) has been used etc.
>
> I was thinking of a option of using a shared variable to keep track of
> what has already been used but that may only work for a single partition
> and would not scale for larger dataset.
>
> Is there any other efficient way to accomplish this ?
>
> --
> Thanks & Regards
> Himanish
>

Mime
View raw message