mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <>
Subject Re: Kmeans clustering with Tanimoto distance measure in Mahout
Date Thu, 21 Jun 2012 10:03:57 GMT
Hi Shlomy.
According to the documentation:
The code uses the Tanimoto formula based on the inner multiple and norms.
Therefore, you get a distance value for every pair of  vectors, even if the
cluster centroids have coordinates different than 0 and 1.
Of course, you can go and read the source code, which I have not done, and
further check this.
Or just run an experiment.

On Thu, Jun 14, 2012 at 11:06 PM, Shlomy Boshy <> wrote:

> Hi all,
> Im doing Kmeans clustering in Mahout using Tanimoto distance measure
> My input are feature vectors for which the indexes are the features and the
> value is 1 for features that exist in the sample, and 0 for non-existing
> features
> (it is actually clustering of users by documents they read, so for each
> user we have 1 in the documents that he read)
> So the input vectors are only 0 or 1
> By the output clusters are double values - not only 0 and 1
> and in the kmeans iterations I guess Kmeans move the cluster centers to
> various values for all features - not only 0 and 1
> So will the Tanimoto distance measure work in this case?
> I think it only gives the Jaccard Index when the values are 0 and 1
> (else it will not reflect the ratio between intersection and union of the
> features in the 2 points)
> If I add feature weights even more it will not be only 0 or 1 values given
> to the distance measure
> So will TanimotoDistanceMeasure really work in KMeans clustering in Hadoop?
> See this link for when Tanimoto is really a proper distance measure:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message