What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary wrote: > Hi, > > I have a RDD of pairs of strings like below : > > (A,B) > (B,C) > (C,D) > (A,D) > (E,F) > (B,F) > > I need to transform/filter this into a RDD of pairs that does not repeat a > string once it has been used once. So something like , > > (A,B) > (C,D) > (E,F) > > (B,C) is out because B has already ben used in (A,B), (A,D) is out because > A (and D) has been used etc. > > I was thinking of a option of using a shared variable to keep track of > what has already been used but that may only work for a single partition > and would not scale for larger dataset. > > Is there any other efficient way to accomplish this ? > > -- > Thanks & Regards > Himanish >