It's not so cheap to compute row similarities when there are many rows, as
it amounts to computing the outer product of a matrix A (i.e. computing
AA^T, which is expensive).
There is a JIRA to track handling (1) and (2) more efficiently than
computing all pairs: https://issues.apache.org/jira/browse/SPARK3066
On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das <debasish.das83@gmail.com>
wrote:
> Hi,
>
> It seems there are multiple places where we would like to compute row
> similarity (accurate or approximate similarities)
>
> Basically through RowMatrix columnSimilarities we can compute column
> similarities of a tall skinny matrix
>
> Similarly we should have an API in RowMatrix called rowSimilarities where
> we can compute similar rows in a mapreduce fashion. It will be useful for
> following usecases:
>
> 1. Generate topK users for each user from matrix factorization model
> 2. Generate topK products for each product from matrix factorization model
> 3. Generate kernel matrix for use in spectral clustering
> 4. Generate kernel matrix for use in kernel regression/classification
>
> I am not sure if there are already good implementation for mapreduce row
> similarity that we can use (ideas like fastfood and kitchen sink felt more
> like for classification usecase but for recommendation also user
> similarities show up which is unsupervised)...
>
> Is there a JIRA tracking it ? If not I can open one and we can discuss
> further on it.
>
> Thanks.
> Deb
>
