spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reza Zadeh <r...@databricks.com>
Subject Re: DIMSUM and ColumnSimilarity use case ?
Date Wed, 10 Dec 2014 18:52:25 GMT
As Sean mentioned, you would be computing similar features then.

If you want to find similar users, I suggest running k-means with some
fixed number of clusters. It's not reasonable to try and compute all pairs
of similarities between 1bn items, so k-means with fixed k is more suitable
here.

Best,
Reza

On Wed, Dec 10, 2014 at 10:39 AM, Sean Owen <sowen@cloudera.com> wrote:

> Well, you're computing similarity of your features then. Whether it is
> meaningful depends a bit on the nature of your features and more on
> the similarity algorithm.
>
> On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa <jaonary@gmail.com>
> wrote:
> > Dear all,
> >
> > I'm trying to understand what is the correct use case of ColumnSimilarity
> > implemented in RowMatrix.
> >
> > As far as I know, this function computes the similarity of a column of a
> > given matrix. The DIMSUM paper says that it's efficient for large m
> (rows)
> > and small n (columns). In this case the output will be a n by n matrix.
> >
> > Now, suppose I want to compute similarity of several users, say m =
> > billions. Each users is described by a high dimensional feature vector,
> say
> > n = 10000. In my dataset, one row represent one user. So in that case
> > computing the similarity my matrix is not the same as computing the
> > similarity of all users. Then, what does it mean computing the
> similarity of
> > the columns of my matrix in this case ?
> >
> > Best regards,
> >
> > Jao
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message