Row = similarity with LLR is much simpler than cosine since you only need = non-zero sums for column, row, and matrix elements so rowSimilarity is = implemented in Mahout for Spark. Full blown row similarity including all = the different similarity methods (long since implemented in hadoop = mapreduce) hasn=E2=80=99t been moved to spark yet.

Yep, rows are not covered in the blog, = my mistake. Too bad it has a lot of uses and can at very least be = optimized for output matrix symmetry.

On Jan 17, 2015, at 11:44 AM, Andrew Musselman <andrew.musselman@gmail.com> wrote:

Yeah okay, = thanks.

On Jan 17, 2015, at 11:15 = AM, Reza Zadeh <reza@databricks.com> wrote:

Pat, columnSimilarities is what that blog post = is about, and is already part of Spark 1.2.

rowSimilarities in a RowMatrix is a = little more tricky because you can't transpose a RowMatrix easily, and = is being tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823

Andrew, sometimes (not = always) it's OK to transpose a RowMatrix, if for example the number of = rows in your RowMatrix is less than 1m, you can transpose it and use = rowSimilarities.

On Sat, = Jan 17, 2015 at 10:45 AM, Pat Ferrel wrote:
BTW it looks like row and = column similarities (cosine based) are coming to MLlib through DIMSUM. = Andrew said rowSimilarity doesn=E2=80=99t seem to be in the master yet. = Does anyone know the status?

Also the method for computation = reduction (make it less than O(n^2)) seems rooted in cosine. A different = computation reduction method is used in the Mahout code tied to LLR. = Seems like we should get these together.

On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.musselman@gmail.com> wrote:

Excellent, thanks Pat.

On = Jan 17, 2015, at 9:27 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

Mahout=E2=80=99s Spark implementation of rowsimilarity is in = the Scala SimilarityAnalysis class. It actually does either row or = column similarity but only supports LLR at present. It does [AA=E2=80=99] = for columns or [A=E2=80=99A] for rows first then calculates the distance = (LLR) for non-zero elements. This is a major optimization for sparse = matrices. As I recall the old hadoop code only did this for half the = matrix since it=E2=80=99s symmetric but that optimization isn=E2=80=99t = in the current code because the downsampling is done as LLR is = calculated, so the entire similarity matrix is never actually calculated = unless you disable downsampling.

The primary use is for recommenders but = I=E2=80=99ve used it (in the test suite) for row-wise text token = similarity too.

On Jan 17, 2015, at 9:00 AM, = Andrew Musselman <andrew.musselman@gmail.com> = wrote:

Yeah that's the kind of thing I'm looking = for; was looking at SPARK-4259 and poking around to see how to do = things.

On Jan 17, 2015, at 8:35 = AM, Suneel Marthi <suneel_marthi@yahoo.com> wrote:

Andrew, u would be better off using Mahout's = RowSimilarityJob for what u r trying to accomplish.

1.  It does give u pair-wise distances
2.  U can specify the Distance measure = u r looking to use
3.  = There's the old MapReduce impl and the Spark DSL impl per ur = preference.

=
From: = Andrew Musselman <andrew.musselman@gmail.com>
Cc: user <user@spark.apache.org>
Sent: = Saturday, January 17, 2015 11:29 AM
Subject: Re: Row = similarities

Thanks Reza, = interesting approach.  I think what I actually want is to calculate = pair-wise distance, on second thought.  Is there a pattern for = that?

On Jan 16, = 2015, at 9:53 PM, Reza Zadeh <reza@databricks.com> wrote:

You can use K-means with a suitably large k. Each = cluster should correspond to rows that are similar to one = another.

On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman wrote:
What's a good way to calculate = similarities between all vector-rows in a matrix or RDD[Vector]?

I'm seeing = RowMatrix has a columnSimilarities method but I'm not sure I'm going = down a good path to transpose a matrix in order to run that.

=

= --Apple-Mail=_786B0F79-F3E2-4D97-9D51-3701C272965A--