spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <>
Subject Re: DIMSUM and ColumnSimilarity use case ?
Date Wed, 10 Dec 2014 17:22:04 GMT
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did not rate movie Top Gun) , column similarities will consider it
as implicit 0...

If you want similar users you want to generate a m x m matrix and you are
going towards kernel matrix...The general problem is to take a m x n matrix
that has n features and increase it to m features where m > n....cosine for
linear kernel and RBF for non-linear kernel...

dimsum/col similarity map-reduce is not optimized for kernel matrix need to look into map-reduce kernel matrix
generation....this kernel matrix can then help you answer similar users,
spectral clustering and kernel regression/classification/SVM if you have

A simplification to the problem is to take your m x n matrix and run
k-Means on it which will produce cluster of for each user you
can compute closest in it's cluster...that drops down complexity from
O(m*m) to O(m*c) where c is the max number of user in each cluster...

On Wed, Dec 10, 2014 at 7:39 AM, Sean Owen <> wrote:

> Well, you're computing similarity of your features then. Whether it is
> meaningful depends a bit on the nature of your features and more on
> the similarity algorithm.
> On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa <>
> wrote:
> > Dear all,
> >
> > I'm trying to understand what is the correct use case of ColumnSimilarity
> > implemented in RowMatrix.
> >
> > As far as I know, this function computes the similarity of a column of a
> > given matrix. The DIMSUM paper says that it's efficient for large m
> (rows)
> > and small n (columns). In this case the output will be a n by n matrix.
> >
> > Now, suppose I want to compute similarity of several users, say m =
> > billions. Each users is described by a high dimensional feature vector,
> say
> > n = 10000. In my dataset, one row represent one user. So in that case
> > computing the similarity my matrix is not the same as computing the
> > similarity of all users. Then, what does it mean computing the
> similarity of
> > the columns of my matrix in this case ?
> >
> > Best regards,
> >
> > Jao
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message