spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanish Kushary <himan...@gmail.com>
Subject [No Subject]
Date Wed, 25 Mar 2015 17:02:02 GMT
Hi,

I have a RDD of pairs of strings like below :

(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)

I need to transform/filter this into a RDD of pairs that does not repeat a
string once it has been used once. So something like ,

(A,B)
(C,D)
(E,F)

(B,C) is out because B has already ben used in (A,B), (A,D) is out because
A (and D) has been used etc.

I was thinking of a option of using a shared variable to keep track of what
has already been used but that may only work for a single partition and
would not scale for larger dataset.

Is there any other efficient way to accomplish this ?

-- 
Thanks & Regards
Himanish

Mime
View raw message