mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen" <>
Subject Re: Text clustering
Date Sat, 06 Dec 2008 13:04:10 GMT
To answer a few recent points:

Not sure if this is helpful, but, the collaborative filtering part of
Mahout contains an implementation of cosine distance measure -- sort
of. Really it has an implementation of the Pearson correlation, which
is equivalent, if the data are 'centered' (have a mean of 0). This is,
in my opinion, a good idea. So if you agree, you could copy and adapt
this implementation of Pearson to your purpose. It is pretty easy to
re-create the actual cosine distance measure correlation too from this
code -- I used to have it separately in the code.

The Tanimoto distance is a ratio of intersection to union of two sets,
so is between 0 and 1. Cosine distance is, essentially, the cosine of
an angle in feature-space, so is between -1 and 1.

On Sat, Dec 6, 2008 at 12:54 PM, Philippe Lamarche
<> wrote:
> Hi,
> I used the Tanimoto distance. As I understand it, it's almost like the
> cosine distance, with a range between 0 and infinity as opposed to 0 and
> 3.14. Seems to work well.
> On Fri, Dec 5, 2008 at 11:54 PM, dipesh <> wrote:
>> Hi Philippe,
>> I'm also doing some work on text clustering with feature extraction. For
>> text clustering the Cosine Distance is considered a better Similarity
>> metrics than the Eucledian Distance Measure. I couldn't find
>> CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your
>> clustering project?

View raw message