Row =
similarity with LLR is much simpler than cosine since you only need =
non-zero sums for column, row, and matrix elements so rowSimilarity is =
implemented in Mahout for Spark. Full blown row similarity including all =
the different similarity methods (long since implemented in hadoop =
mapreduce) hasn=E2=80=99t been moved to spark yet.

=
--Apple-Mail=_786B0F79-F3E2-4D97-9D51-3701C272965A--
Yep, rows are not covered in the blog, =
my mistake. Too bad it has a lot of uses and can at very least be =
optimized for output matrix symmetry.

On Jan 17, 2015, at 11:44 AM, Andrew Musselman <andrew.musselman@gmail.com> wrote:

Yeah okay, =
thanks.

Pat, columnSimilarities is what that blog post = is about, and is already part of Spark 1.2.rowSimilarities in a RowMatrix is a = little more tricky because you can't transpose a RowMatrix easily, and = is being tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823Andrew, sometimes (not = always) it's OK to transpose a RowMatrix, if for example the number of = rows in your RowMatrix is less than 1m, you can transpose it and use = rowSimilarities.On Sat, = Jan 17, 2015 at 10:45 AM, Pat Ferrel <pat@occamsmachete.com> wrote:BTW it looks like row and = column similarities (cosine based) are coming to MLlib through DIMSUM. = Andrew said rowSimilarity doesn=E2=80=99t seem to be in the master yet. = Does anyone know the status?See: https://databricks.com/blog/2014/10/20/efficient-similarity-alg= orithm-now-in-spark-twitter.htmlAlso the method for computation = reduction (make it less than O(n^2)) seems rooted in cosine. A different = computation reduction method is used in the Mahout code tied to LLR. = Seems like we should get these together.On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.musselman@gmail.com> wrote:Excellent, thanks Pat.Mahout=E2=80=99s Spark implementation of rowsimilarity is in = the Scala SimilarityAnalysis class. It actually does either row or = column similarity but only supports LLR at present. It does [AA=E2=80=99] = for columns or [A=E2=80=99A] for rows first then calculates the distance = (LLR) for non-zero elements. This is a major optimization for sparse = matrices. As I recall the old hadoop code only did this for half the = matrix since it=E2=80=99s symmetric but that optimization isn=E2=80=99t = in the current code because the downsampling is done as LLR is = calculated, so the entire similarity matrix is never actually calculated = unless you disable downsampling.The primary use is for recommenders but = I=E2=80=99ve used it (in the test suite) for row-wise text token = similarity too.On Jan 17, 2015, at 9:00 AM, = Andrew Musselman <andrew.musselman@gmail.com> = wrote:Yeah that's the kind of thing I'm looking = for; was looking at SPARK-4259 and poking around to see how to do = things.Andrew, u would be better off using Mahout's = RowSimilarityJob for what u r trying to accomplish.1. It does give u pair-wise distances2. U can specify the Distance measure = u r looking to use3. = There's the old MapReduce impl and the Spark DSL impl per ur = preference.

=

From:= Andrew Musselman <andrew.musselman@gmail.com>

To:Reza Zadeh <reza@databricks.com>Cc:user <user@spark.apache.org>

Sent:= Saturday, January 17, 2015 11:29 AM

Subject:Re: Row = similarities

=Thanks Reza, = interesting approach. I think what I actually want is to calculate = pair-wise distance, on second thought. Is there a pattern for = that?You can use K-means with a suitably large k. Each = cluster should correspond to rows that are similar to one = another.On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman <andrew.musselman@gmail.com> wrote:What's a good way to calculate = similarities between all vector-rows in a matrix or RDD[Vector]?I'm seeing = RowMatrix has a columnSimilarities method but I'm not sure I'm going = down a good path to transpose a matrix in order to run that.