spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Row similarities
Date Sat, 17 Jan 2015 18:45:40 GMT
BTW it looks like row and column similarities (cosine based) are coming to MLlib through DIMSUM.
Andrew said rowSimilarity doesn’t seem to be in the master yet. Does anyone know the status?

See: https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
<https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html>

Also the method for computation reduction (make it less than O(n^2)) seems rooted in cosine.
A different computation reduction method is used in the Mahout code tied to LLR. Seems like
we should get these together.
 
On Jan 17, 2015, at 9:37 AM, Andrew Musselman <andrew.musselman@gmail.com> wrote:

Excellent, thanks Pat.

On Jan 17, 2015, at 9:27 AM, Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
wrote:

> Mahout’s Spark implementation of rowsimilarity is in the Scala SimilarityAnalysis class.
It actually does either row or column similarity but only supports LLR at present. It does
[AA’] for columns or [A’A] for rows first then calculates the distance (LLR) for non-zero
elements. This is a major optimization for sparse matrices. As I recall the old hadoop code
only did this for half the matrix since it’s symmetric but that optimization isn’t in
the current code because the downsampling is done as LLR is calculated, so the entire similarity
matrix is never actually calculated unless you disable downsampling. 
> 
> The primary use is for recommenders but I’ve used it (in the test suite) for row-wise
text token similarity too.  
> 
> On Jan 17, 2015, at 9:00 AM, Andrew Musselman <andrew.musselman@gmail.com <mailto:andrew.musselman@gmail.com>>
wrote:
> 
> Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 and poking around
to see how to do things.
> 
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 <https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259>
> 
> On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_marthi@yahoo.com <mailto:suneel_marthi@yahoo.com>>
wrote:
> 
>> Andrew, u would be better off using Mahout's RowSimilarityJob for what u r trying
to accomplish.
>> 
>>  1.  It does give u pair-wise distances
>>  2.  U can specify the Distance measure u r looking to use
>>  3.  There's the old MapReduce impl and the Spark DSL impl per ur preference.
>> 
>> From: Andrew Musselman <andrew.musselman@gmail.com <mailto:andrew.musselman@gmail.com>>
>> To: Reza Zadeh <reza@databricks.com <mailto:reza@databricks.com>> 
>> Cc: user <user@spark.apache.org <mailto:user@spark.apache.org>> 
>> Sent: Saturday, January 17, 2015 11:29 AM
>> Subject: Re: Row similarities
>> 
>> Thanks Reza, interesting approach.  I think what I actually want is to calculate
pair-wise distance, on second thought.  Is there a pattern for that?
>> 
>> 
>> 
>> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <reza@databricks.com <mailto:reza@databricks.com>>
wrote:
>> 
>>> You can use K-means <https://spark.apache.org/docs/latest/mllib-clustering.html>
with a suitably large k. Each cluster should correspond to rows that are similar to one another.
>>> 
>>> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman <andrew.musselman@gmail.com
<mailto:andrew.musselman@gmail.com>> wrote:
>>> What's a good way to calculate similarities between all vector-rows in a matrix
or RDD[Vector]?
>>> 
>>> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm going
down a good path to transpose a matrix in order to run that.
>>> 
>> 
>> 
> 


Mime
View raw message