mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Waggoner <>
Subject best similarity metric for collaborative filtering
Date Tue, 26 Apr 2011 02:21:25 GMT
Sean Owen suggested that we move our discussion of

to this mailing list.

Here's the background.

I'm trying to decide on the best similarity metric for a product 
recommendation system using item-based collaborative filtering. This is a 
shopping basket scenario where ratings are binary valued - the user has 
either purchased an item or not - there is no explicit rating system (eg, 

Step 1 is to compute item-to-item similarity, though I want to look at 
incorporating more features later on.

Is the Tanimoto coefficient the best way to go for binary values? Or are 
there other metrics that are appropriate here? Thanks

Chiming in on this old thread: if you have "binary" ratings (either the 
association exists or doesn't, but has no magnitude), then indeed you are 
forced to look at metrics like the Tanimoto / Jaccard coefficient.

However I'd suggest a log-likelihood similarity metric is significantly 
better for situations like this. Here's the code from Mahout:

@Sean Owen Why do you think log likelihood is better than cosine here? – 
Lao Tzu Mar 14 at 5:59

Cosine won't work in this situation -- there are no ratings. It will be 
undefined for all pairs. – Sean Owen Mar 14 at 9:59
@Sean Owen Take vector A to be a {0,1} list of which users have bought 
item A. Take vector B to be a {0,1} list of which users have bought item 
B. (A^t times B) over (|A| times |B|) computes the cosine similarity 
between the two items. – Lao Tzu Apr 15 at 9:07

Binary ratings are not ratings over {0,1}. That leaves three possible 
states: 0, 1, and non-existent. Ratings are always 1 where they exist. 
Both vectors end up being (0,0) in the cosine measure, and so the cosine 
measure is 0/0 and is undefined. This, at least, is how it would work in 
the formulation Mahout uses. – Sean Owen Apr 15 at 11:28
@Sean Owen As I understand @allclaws it is binary. Bought or not bought. 
"Non-existent" is the same as not-bought. I'm not familiar with Mahout and 
I don't know what you mean by "Both vectors end up being (0,0) in the 
cosine measure". dim(A)=dim(B) = number of customers. And dim( 
similarity(A,B) ) = 1. So what does the (0,0) pair refer to? – Lao Tzu Apr 
15 at 14:02

The data are "centered" in Mahout to make the computation equivalent to 
the Pearson correlation. So, the vectors would start out being like 
(1,1,1,1) and end up being like (0,0,0,0). The cosine measure between two 
degenerate vectors is undefined. Even if you don't center -- the measure 
just gives you "1" in all cases since the angle is 0. – Sean Owen Apr 15 
at 17:38
@Sean Owen Sorry if I'm being thick. I still don't get where the 
constant-value vectors are coming from. Let's say we're talking about item 
"Lord of the Rings DVD". User 1 bought it, users 2 thru 79 didn't, user 80 
bought it, ... and so on. So vec = [1, 0, 0, 0, ..., 0, 0, 1, 0, 0, ... ]. 
There's more I don't get in what you said above but let's start there. – 
Lao Tzu Apr 16 at 4:38

We can discuss at better perhaps. In my mind at 
least, there are never 0 values in the vectors. They are 1, or 
non-existent. Otherwise you have three states: 1, 0, or non-existent. – 
Sean Owen Apr 16 at 9:45


I've never used Mahout but what this @allclaws wants sounds like a simple 
proposition. Given a vector like

didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy
didn't buy

define "bought" == 1 and "didn't buy" == 0. Define distance between two 
such vectors to be { A dot B } over { |A| times |B| }. Not that I find 
this compelling as a definition of similarity but @allclaws called this a 
first, rough pass.
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message