spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <>
Subject Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?
Date Fri, 17 Oct 2014 12:02:59 GMT
Cartesian joins of large datasets are usually going to be slow. If there
is a way you can reduce the problem space to make sure you only join
subsets with each other, that may be helpful. Maybe if you explain your
problem in more detail, people on the list can come up with more

Best Regards,
Nube Technologies <>


On Fri, Oct 17, 2014 at 4:13 PM, Jaonary Rabarisoa <>

> Hi all,
> I need to compute a similiarity between elements of two large sets of high
> dimensional feature vector.
> Naively, I create all possible pair of vectors with
> * features1.cartesian(features2)* and then map the produced paired rdd
> with my similarity function.
> The problem is that the cartesian operation takes a lot times, more time
> that computing the similarity itself. If I save each of my  feature vector
> into disk, form a list of file name pair and compute the similarity by
> reading the files it runs significantly much faster.
> Any ideas will be helpful,
> Cheers,
> Jao

View raw message