spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paschalis Veskos <>
Subject Avoid Cartesian product in calculating a distance matrix?
Date Fri, 05 Aug 2016 17:20:09 GMT
Hello everyone,

I am interested in running an application on Spark that at some point
needs to compare all elements of an RDD against all others to create a
distance matrix. The RDD is of type <String, Double> and the Pearson
correlation is applied to each element against all others, generating
a matrix with the distance between all possible combinations of

I have implemented this by taking the cartesian product of the RDD
with itself, filtering half the matrix away since it is symmetric,
then doing a combineByKey to get all other elements that it needs to
be compared with. I map the output of this over the comparison
function implementing the Pearson correlation.

You can probably guess this is dead slow. I use Spark 1.6.2, the code
is written in Java 8. At the rate it is processing in a cluster with 4
nodes with 16cores and 56gb ram each, for a list with ~15000 elements
split in 512 partitions, the cartesian operation alone is estimated to
take about 3000 hours (all cores are maxed out on all machines)!

Is there a way to avoid the cartesian product to calculate what I
want? Would a DataFrame join be faster? Or is this an operation that
just requires a much larger cluster?

Thank you,


To unsubscribe e-mail:

View raw message