mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shlomy Boshy <shlo...@outbrain.com>
Subject Kmeans clustering with Tanimoto distance measure in Mahout
Date Thu, 14 Jun 2012 20:06:31 GMT
Hi all,

Im doing Kmeans clustering in Mahout using Tanimoto distance measure

My input are feature vectors for which the indexes are the features and the
value is 1 for features that exist in the sample, and 0 for non-existing
features
(it is actually clustering of users by documents they read, so for each
user we have 1 in the documents that he read)

So the input vectors are only 0 or 1

By the output clusters are double values - not only 0 and 1
and in the kmeans iterations I guess Kmeans move the cluster centers to
various values for all features - not only 0 and 1

So will the Tanimoto distance measure work in this case?
I think it only gives the Jaccard Index when the values are 0 and 1
(else it will not reflect the ratio between intersection and union of the
features in the 2 points)

If I add feature weights even more it will not be only 0 or 1 values given
to the distance measure

So will TanimotoDistanceMeasure really work in KMeans clustering in Hadoop?

See this link for when Tanimoto is really a proper distance measure:
http://en.wikipedia.org/wiki/Jaccard_index

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message