spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roch Denis <rde...@exostatic.com>
Subject Re: Help in merging a RDD agaisnt itself using the V of a (K,V).
Date Thu, 24 Jul 2014 02:29:36 GMT
For what it's worth, I got it to work with a Cartesian product even if it's
very inefficient it worked out alright for me. The trick was to flat map it
(step4) after the cartesian product so that I could do a reduce by key and
find the commonalities. After I was done, I could check if any Value pair
had a matching value in any other value pair. If yes, I run it another time.

The process is something like this:

SUBSTEP 1: CARTESIAN + FILTER( non inclusive set : False )
        SET: ((frozenset(['A']), frozenset([1, 2])), (frozenset(['A']),
frozenset([1, 2])))
        SET: ((frozenset(['A']), frozenset([1, 2])), (frozenset(['B']),
frozenset([2, 3])))
        SET: ((frozenset(['A']), frozenset([1, 2])), (frozenset(['S']),
frozenset([1, 2, 100])))
        SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['A']),
frozenset([1, 2])))
        SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['B']),
frozenset([2, 3])))
        SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['C']),
frozenset([3, 4])))
        SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['S']),
frozenset([1, 2, 100])))
        SET: ((frozenset(['C']), frozenset([3, 4])), (frozenset(['B']),
frozenset([2, 3])))
        SET: ((frozenset(['C']), frozenset([3, 4])), (frozenset(['C']),
frozenset([3, 4])))
        SET: ((frozenset(['G']), frozenset([10, 20])), (frozenset(['G']),
frozenset([10, 20])))
        SET: ((frozenset(['G']), frozenset([10, 20])), (frozenset(['Z']),
frozenset([1000, 20])))
        SET: ((frozenset(['Z']), frozenset([1000, 20])), (frozenset(['G']),
frozenset([10, 20])))
        SET: ((frozenset(['Z']), frozenset([1000, 20])), (frozenset(['Z']),
frozenset([1000, 20])))
        SET: ((frozenset(['S']), frozenset([1, 2, 100])), (frozenset(['A']),
frozenset([1, 2])))
        SET: ((frozenset(['S']), frozenset([1, 2, 100])), (frozenset(['B']),
frozenset([2, 3])))
        SET: ((frozenset(['S']), frozenset([1, 2, 100])), (frozenset(['S']),
frozenset([1, 2, 100])))
SUBSTEP 2 : MERGE
        SET: (frozenset(['A']), frozenset([1, 2]))
        SET: (frozenset(['A', 'B']), frozenset([1, 2, 3]))
        SET: (frozenset(['A', 'S']), frozenset([1, 2, 100]))
        SET: (frozenset(['A', 'B']), frozenset([1, 2, 3]))
        SET: (frozenset(['B']), frozenset([2, 3]))
        SET: (frozenset(['C', 'B']), frozenset([2, 3, 4]))
        SET: (frozenset(['S', 'B']), frozenset([1, 2, 3, 100]))
        SET: (frozenset(['C', 'B']), frozenset([2, 3, 4]))
        SET: (frozenset(['C']), frozenset([3, 4]))
        SET: (frozenset(['G']), frozenset([10, 20]))
        SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))
        SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))
        SET: (frozenset(['Z']), frozenset([1000, 20]))
        SET: (frozenset(['A', 'S']), frozenset([1, 2, 100]))
        SET: (frozenset(['S', 'B']), frozenset([1, 2, 3, 100]))
        SET: (frozenset(['S']), frozenset([1, 2, 100]))
SUBSTEP 3 : DISTINCT
        SET: (frozenset(['A']), frozenset([1, 2]))
        SET: (frozenset(['C']), frozenset([3, 4]))
        SET: (frozenset(['S']), frozenset([1, 2, 100]))
        SET: (frozenset(['A', 'S']), frozenset([1, 2, 100]))
        SET: (frozenset(['A', 'B']), frozenset([1, 2, 3]))
        SET: (frozenset(['B']), frozenset([2, 3]))
        SET: (frozenset(['S', 'B']), frozenset([1, 2, 3, 100]))
        SET: (frozenset(['G']), frozenset([10, 20]))
        SET: (frozenset(['C', 'B']), frozenset([2, 3, 4]))
        SET: (frozenset(['Z']), frozenset([1000, 20]))
        SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))
SUBSTEP 4: flatmap
        SET: ('A', (frozenset(['A']), frozenset([1, 2])))
        SET: ('C', (frozenset(['C']), frozenset([3, 4])))
        SET: ('S', (frozenset(['S']), frozenset([1, 2, 100])))
        SET: ('A', (frozenset(['A', 'S']), frozenset([1, 2, 100])))
        SET: ('S', (frozenset(['A', 'S']), frozenset([1, 2, 100])))
        SET: ('A', (frozenset(['A', 'B']), frozenset([1, 2, 3])))
        SET: ('B', (frozenset(['A', 'B']), frozenset([1, 2, 3])))
        SET: ('B', (frozenset(['B']), frozenset([2, 3])))
        SET: ('S', (frozenset(['S', 'B']), frozenset([1, 2, 3, 100])))
        SET: ('B', (frozenset(['S', 'B']), frozenset([1, 2, 3, 100])))
        SET: ('G', (frozenset(['G']), frozenset([10, 20])))
        SET: ('C', (frozenset(['C', 'B']), frozenset([2, 3, 4])))
        SET: ('B', (frozenset(['C', 'B']), frozenset([2, 3, 4])))
        SET: ('Z', (frozenset(['Z']), frozenset([1000, 20])))
        SET: ('Z', (frozenset(['Z', 'G']), frozenset([1000, 10, 20])))
        SET: ('G', (frozenset(['Z', 'G']), frozenset([1000, 10, 20])))
SUBSTEP 5: reduceByKey
        SET: ('A', (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100])))
        SET: ('C', (frozenset(['C', 'B']), frozenset([2, 3, 4])))
        SET: ('B', (frozenset(['A', 'S', 'B', 'C']), frozenset([1, 2, 3,
100, 4])))
        SET: ('G', (frozenset(['Z', 'G']), frozenset([1000, 10, 20])))
        SET: ('S', (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100])))
        SET: ('Z', (frozenset(['Z', 'G']), frozenset([1000, 10, 20])))
SUBSTEP 6: map
        SET: (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100]))
        SET: (frozenset(['C', 'B']), frozenset([2, 3, 4]))
        SET: (frozenset(['A', 'S', 'B', 'C']), frozenset([1, 2, 3, 100, 4]))
        SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))
        SET: (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100]))
        SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))
SUBSTEP 7: distinct
        SET: (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100]))
        SET: (frozenset(['A', 'S', 'B', 'C']), frozenset([1, 2, 3, 100, 4]))
        SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))
        SET: (frozenset(['C', 'B']), frozenset([2, 3, 4]))




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-in-merging-a-RDD-agaisnt-itself-using-the-V-of-a-K-V-tp10530p10560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Mime
View raw message