spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaonary Rabarisoa <>
Subject Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?
Date Fri, 17 Oct 2014 21:41:40 GMT
Hi Reza,

Thank you for the suggestion. The number of point are not that large about
1000 for each set. So I have 1000x1000 pairs. But, my similarity is
obtained using a metric learning to rank and from spark it is viewed as a
black box. So my idea was just to distribute the computation of the
1000x1000 similarities.

After some investigation, I managed to make it run faster. My feature
vectors are obtained after a join operation and I didn't cache the result
of this operation before the cartesian operation. Caching the result of the
join operation make my code runs amazingly faster. So I think, the real
problem I have is the lack of good practice on spark programming.


On Fri, Oct 17, 2014 at 11:08 PM, Reza Zadeh <> wrote:

> Hi Jaonary,
> What are the numbers, i.e. number of points you're trying to do all-pairs
> on, and the dimension of each?
> Have you tried the new implementation of columnSimilarities in RowMatrix?
> Setting the threshold high enough (potentially above 1.0) might solve your
> problem, here is an example
> <>
> .
> This implements the DIMSUM sampling scheme, recently merged into master
> <>.
> Best,
> Reza
> On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa <>
> wrote:
>> Hi all,
>> I need to compute a similiarity between elements of two large sets of
>> high dimensional feature vector.
>> Naively, I create all possible pair of vectors with
>> * features1.cartesian(features2)* and then map the produced paired rdd
>> with my similarity function.
>> The problem is that the cartesian operation takes a lot times, more time
>> that computing the similarity itself. If I save each of my  feature vector
>> into disk, form a list of file name pair and compute the similarity by
>> reading the files it runs significantly much faster.
>> Any ideas will be helpful,
>> Cheers,
>> Jao

View raw message