spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From khayyatzy <>
Subject [GitHub] incubator-spark pull request: Adding RDD unique self cross product
Date Thu, 13 Feb 2014 08:53:57 GMT
Github user khayyatzy commented on the pull request:
    I am using rdd.selfCartesian for optimization purposes. I am using Spark for large data
analytic project on relational data. My application sometimes require to compare the table
with itself looking for inconsistency within the data regardless of the order of compared
    One advantage of rdd.selfCartesian of is that it generates almost half the results of
rdd.cartesian(rdd). For example, a table with 100 rows, the rdd.cartesian(rdd) will generate
10000 tuples to compare while the rdd.selfCartesian will only generate 5050 tuples.
    Another advantage is that rdd.selfCartesian helps me to get rid of the duplicate errors
when searching for tuple inconsistencies. In my application, if an error can be found for
tuples with the order (tx,ty), the same error can also be found if they are in the opposite
order (ty,tx). If I used rdd.cartesian(rdd) I will have to look for duplicate errors in the
resulted RDDPair and remove them.
    Zuhair Khayyat

View raw message