As Sean mentioned, you would be computing similar features then.
If you want to find similar users, I suggest running kmeans with some
fixed number of clusters. It's not reasonable to try and compute all pairs
of similarities between 1bn items, so kmeans with fixed k is more suitable
here.
Best,
Reza
On Wed, Dec 10, 2014 at 10:39 AM, Sean Owen <sowen@cloudera.com> wrote:
> Well, you're computing similarity of your features then. Whether it is
> meaningful depends a bit on the nature of your features and more on
> the similarity algorithm.
>
> On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa <jaonary@gmail.com>
> wrote:
> > Dear all,
> >
> > I'm trying to understand what is the correct use case of ColumnSimilarity
> > implemented in RowMatrix.
> >
> > As far as I know, this function computes the similarity of a column of a
> > given matrix. The DIMSUM paper says that it's efficient for large m
> (rows)
> > and small n (columns). In this case the output will be a n by n matrix.
> >
> > Now, suppose I want to compute similarity of several users, say m =
> > billions. Each users is described by a high dimensional feature vector,
> say
> > n = 10000. In my dataset, one row represent one user. So in that case
> > computing the similarity my matrix is not the same as computing the
> > similarity of all users. Then, what does it mean computing the
> similarity of
> > the columns of my matrix in this case ?
> >
> > Best regards,
> >
> > Jao
>
> 
> To unsubscribe, email: userunsubscribe@spark.apache.org
> For additional commands, email: userhelp@spark.apache.org
>
>
