mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <yuv...@citypath.com>
Subject Re: Kmeans clustering with Tanimoto distance measure in Mahout
Date Thu, 21 Jun 2012 10:03:57 GMT
Hi Shlomy.
According to the documentation:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/distance/TanimotoDistanceMeasure.html
The code uses the Tanimoto formula based on the inner multiple and norms.
Therefore, you get a distance value for every pair of  vectors, even if the
cluster centroids have coordinates different than 0 and 1.
Of course, you can go and read the source code, which I have not done, and
further check this.
Or just run an experiment.
Cheers,
Yuval


On Thu, Jun 14, 2012 at 11:06 PM, Shlomy Boshy <shlomyb@outbrain.com> wrote:

> Hi all,
>
> Im doing Kmeans clustering in Mahout using Tanimoto distance measure
>
> My input are feature vectors for which the indexes are the features and the
> value is 1 for features that exist in the sample, and 0 for non-existing
> features
> (it is actually clustering of users by documents they read, so for each
> user we have 1 in the documents that he read)
>
> So the input vectors are only 0 or 1
>
> By the output clusters are double values - not only 0 and 1
> and in the kmeans iterations I guess Kmeans move the cluster centers to
> various values for all features - not only 0 and 1
>
> So will the Tanimoto distance measure work in this case?
> I think it only gives the Jaccard Index when the values are 0 and 1
> (else it will not reflect the ratio between intersection and union of the
> features in the 2 points)
>
> If I add feature weights even more it will not be only 0 or 1 values given
> to the distance measure
>
> So will TanimotoDistanceMeasure really work in KMeans clustering in Hadoop?
>
> See this link for when Tanimoto is really a proper distance measure:
> http://en.wikipedia.org/wiki/Jaccard_index
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message