mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: RowSimilarityJob
Date Thu, 31 May 2012 03:22:10 GMT
Pat,

Here is an example from the output of the rowsimilarity job for a corpus I am working with
(using Cosine Similarity).

Key: 25: Value: {27433:0.9999999999999994}


What this means is that Document# 26 is similar to Document# 27433by a factor of 0.999.

Since Distance = (1 - Similarity), this means that the distance between documents 25 and 27433
above is 0 (= 1 - 0.999), or in other words they are very similar.

Hope that clarifies.

Suneel



________________________________
 From: Pat Ferrel <pat@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Wednesday, May 30, 2012 10:22 PM
Subject: RowSimilarityJob
 
What is the value created to describe simlarity by RowSimilarityJob? The paper which describes
how the algorithm is implemented doesn't describe the various similarity values returned by
mahout. It seems to focus on cooccurrences.

For SIMILARITY_COSINE is the value = cosine or 1 - cosine?

Is the value calculated after cooccurrences determines similar docs independently?

The code is very difficult to read so a little help would be appreciated.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message