mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Tomsett" <indigentmart...@gmail.com>
Subject Re: Text clustering
Date Sat, 06 Dec 2008 22:20:39 GMT
Ah, I didn't realise that there was an implementation of the Pearson
correlation, I just wrote a cosine distance measure myself. The cosine
distance does go from -1 to 1, but with TF-IDF vectors you aren't going to
get any negative values, so it effectively goes from 0 to 1. You have to be
careful though because the k-means implementation assumes larger distance
value means "further away" (for clustering purposes), whereas obviously with
cosine distance a larger value means "closer together".


2008/12/6 Sean Owen <srowen@gmail.com>

> To answer a few recent points:
>
> Not sure if this is helpful, but, the collaborative filtering part of
> Mahout contains an implementation of cosine distance measure -- sort
> of. Really it has an implementation of the Pearson correlation, which
> is equivalent, if the data are 'centered' (have a mean of 0). This is,
> in my opinion, a good idea. So if you agree, you could copy and adapt
> this implementation of Pearson to your purpose. It is pretty easy to
> re-create the actual cosine distance measure correlation too from this
> code -- I used to have it separately in the code.
>
> The Tanimoto distance is a ratio of intersection to union of two sets,
> so is between 0 and 1. Cosine distance is, essentially, the cosine of
> an angle in feature-space, so is between -1 and 1.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message