It will only give (A,B). I am generating the pair from combinations of the the strings A,B,C and D, so the pairs (ignoring order) would be

(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)

On successful filtering using the original condition it will transform to (A,B) and (C,D)

On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld <nkronenfeld@uncharted.software> wrote:
What would it do with the following dataset?

(A, B)
(A, C)
(B, D)


On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary <himanish@gmail.com> wrote:
Hi,

I have a RDD of pairs of strings like below :

(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)

I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like ,

(A,B)
(C,D)
(E,F)

(B,C) is out because B has already ben used in (A,B), (A,D) is out because A (and D) has been used etc.

I was thinking of a option of using a shared variable to keep track of what has already been used but that may only work for a single partition and would not scale for larger dataset.

Is there any other efficient way to accomplish this ?

--
Thanks & Regards
Himanish




--
Thanks & Regards
Himanish