spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <debasish.da...@gmail.com>
Subject Re: DIMSUM and ColumnSimilarity use case ?
Date Wed, 10 Dec 2014 17:22:04 GMT
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation matrix...it can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did not rate movie Top Gun) , column similarities will consider it
as implicit 0...

If you want similar users you want to generate a m x m matrix and you are
going towards kernel matrix...The general problem is to take a m x n matrix
that has n features and increase it to m features where m > n....cosine for
linear kernel and RBF for non-linear kernel...

dimsum/col similarity map-reduce is not optimized for kernel matrix
generation..you need to look into map-reduce kernel matrix
generation....this kernel matrix can then help you answer similar users,
spectral clustering and kernel regression/classification/SVM if you have
labels...

A simplification to the problem is to take your m x n matrix and run
k-Means on it which will produce cluster of users..now for each user you
can compute closest in it's cluster...that drops down complexity from
O(m*m) to O(m*c) where c is the max number of user in each cluster...


On Wed, Dec 10, 2014 at 7:39 AM, Sean Owen <sowen@cloudera.com> wrote:

> Well, you're computing similarity of your features then. Whether it is
> meaningful depends a bit on the nature of your features and more on
> the similarity algorithm.
>
> On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa <jaonary@gmail.com>
> wrote:
> > Dear all,
> >
> > I'm trying to understand what is the correct use case of ColumnSimilarity
> > implemented in RowMatrix.
> >
> > As far as I know, this function computes the similarity of a column of a
> > given matrix. The DIMSUM paper says that it's efficient for large m
> (rows)
> > and small n (columns). In this case the output will be a n by n matrix.
> >
> > Now, suppose I want to compute similarity of several users, say m =
> > billions. Each users is described by a high dimensional feature vector,
> say
> > n = 10000. In my dataset, one row represent one user. So in that case
> > computing the similarity my matrix is not the same as computing the
> > similarity of all users. Then, what does it mean computing the
> similarity of
> > the columns of my matrix in this case ?
> >
> > Best regards,
> >
> > Jao
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message