spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaonary Rabarisoa <jaon...@gmail.com>
Subject Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?
Date Fri, 17 Oct 2014 21:41:40 GMT
Hi Reza,

Thank you for the suggestion. The number of point are not that large about
1000 for each set. So I have 1000x1000 pairs. But, my similarity is
obtained using a metric learning to rank and from spark it is viewed as a
black box. So my idea was just to distribute the computation of the
1000x1000 similarities.

After some investigation, I managed to make it run faster. My feature
vectors are obtained after a join operation and I didn't cache the result
of this operation before the cartesian operation. Caching the result of the
join operation make my code runs amazingly faster. So I think, the real
problem I have is the lack of good practice on spark programming.

Best
Jao

On Fri, Oct 17, 2014 at 11:08 PM, Reza Zadeh <reza@databricks.com> wrote:

> Hi Jaonary,
>
> What are the numbers, i.e. number of points you're trying to do all-pairs
> on, and the dimension of each?
>
> Have you tried the new implementation of columnSimilarities in RowMatrix?
> Setting the threshold high enough (potentially above 1.0) might solve your
> problem, here is an example
> <https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala>
> .
>
> This implements the DIMSUM sampling scheme, recently merged into master
> <https://github.com/apache/spark/pull/1778>.
>
> Best,
> Reza
>
> On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa <jaonary@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I need to compute a similiarity between elements of two large sets of
>> high dimensional feature vector.
>> Naively, I create all possible pair of vectors with
>> * features1.cartesian(features2)* and then map the produced paired rdd
>> with my similarity function.
>>
>> The problem is that the cartesian operation takes a lot times, more time
>> that computing the similarity itself. If I save each of my  feature vector
>> into disk, form a list of file name pair and compute the similarity by
>> reading the files it runs significantly much faster.
>>
>> Any ideas will be helpful,
>>
>> Cheers,
>>
>> Jao
>>
>>
>>
>>
>

Mime
View raw message