Interesting - to which thresholds do you refer? On Thu, Jun 21, 2012 at 2:24 PM, Sean Owen wrote: > Granted I too may be missing something since I am not familiar with > the code so much, but... > > The Tanimoto "distance" isn't a proper distance metric is it? not when > defined over real-valued vectors like it is here. That seems like the > root issue here. I'm pretty sure we need a distance metric in > clustering right? > > It's not that the Tanimoto distance isn't defined on these vectors -- > it is, or can be, and is in Mahout. It's also not a problem to take > the centroid as a mean of vectors; that's valid. > > The problem you describe seems like a matter of not setting the > clustering thresholds quite right. > > > On Thu, Jun 21, 2012 at 11:39 AM, Shlomy Boshy > wrote: > > Yuval - Thx for your reply > > > > It works technically but I believe it doesnt give good logical results - > I > > think it has a logical issue when it comes to this distance measure > > > > The problem is that the current kmeans optimization in Mahout > > does connect points to clusters by the distance measure - so I can use a > > distance measure like Tanimoto which handles well binary measures > (Jaccard > > Index etc.) > > BUT for making new cluster centers > > it *always *uses an arithmetic average in each feature > > (i.e. sum of values in the measure X in all points belonging to the > > cluster, divided by the number of points) > > for any distance measure > > [I read the code - I hope my understanding was right] > > > > This is good for Euclidean distance > > > > but for binary measures / Jaccard index distance function > > I believe this doesnt create a cluster center well. > > > > A cluster center should be a point "close" to many of the points in the > > cluster by the distance measure > > > > Also by the logical target of the Tanimoto measure it needs to "close" in > > the sense of high intersection with the points, and low union (Jaccard > > Index). > > > > If I use Jaccard index / Tanimoto as the distance measure > > It think I want a cluster center which reflects the following property: > > the features in the cluster center which are non-zero should be the > > features appearing in most of the points in the cluster > > (i.e. if we order the features by their occurences in points in the > cluster > > we want the top ones in the cluser center) > > > > This way the cluster center will become "close" to most points in the > > cluster in the Tanimoto distance measure: > > it will have a large intersection with many of the points, and not so > large > > union with them > > (The issue is how much of these features to take into the cluster center > - > > I figured the avg number of features in a point * some ratio) > > > > (I can state it more formally - this is just a short description of the > > idea...) > > > > When using the current kmeans implementation in Kmeans > > I got cluster centers with many-many features with very low weights > (0.001 > > etc.) - which didnt get me to clustering of binary measures as I wanted > > points were not divided well between clusters (all into one huge cluster) > > > > Maybe there is a mathematical meaning - but it doesnt logically solve the > > problem I'm trying to solve > > (cluster users to clusters by the documents they read - that is why it is > > binary) > > > > So I implemented myself kmeans for binary measures with the solution > > describes above > > and it worked much better - I got small cluster centers - binary (0/1 in > > the value) > > points where distributed well between clusters > > > > > > What do you think of this problem/solution? > > > > If more think this is an interesting direction I might want to contribute > > it into the Mahout library. > > Would it interest anyone? > > > > I think it focuses on a different way to create a cluster center for > points > > when your distance measure deals with binary measures and is based on > > Jaccard Index / Tanimoto distance measure > > > > > > > > > > On Thu, Jun 21, 2012 at 1:03 PM, Yuval Feinstein >wrote: > > > >> Hi Shlomy. > >> According to the documentation: > >> > >> > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/distance/TanimotoDistanceMeasure.html > >> The code uses the Tanimoto formula based on the inner multiple and > norms. > >> Therefore, you get a distance value for every pair of vectors, even if > the > >> cluster centroids have coordinates different than 0 and 1. > >> Of course, you can go and read the source code, which I have not done, > and > >> further check this. > >> Or just run an experiment. > >> Cheers, > >> Yuval > >> > >> > >> On Thu, Jun 14, 2012 at 11:06 PM, Shlomy Boshy > >> wrote: > >> > >> > Hi all, > >> > > >> > Im doing Kmeans clustering in Mahout using Tanimoto distance measure > >> > > >> > My input are feature vectors for which the indexes are the features > and > >> the > >> > value is 1 for features that exist in the sample, and 0 for > non-existing > >> > features > >> > (it is actually clustering of users by documents they read, so for > each > >> > user we have 1 in the documents that he read) > >> > > >> > So the input vectors are only 0 or 1 > >> > > >> > By the output clusters are double values - not only 0 and 1 > >> > and in the kmeans iterations I guess Kmeans move the cluster centers > to > >> > various values for all features - not only 0 and 1 > >> > > >> > So will the Tanimoto distance measure work in this case? > >> > I think it only gives the Jaccard Index when the values are 0 and 1 > >> > (else it will not reflect the ratio between intersection and union of > the > >> > features in the 2 points) > >> > > >> > If I add feature weights even more it will not be only 0 or 1 values > >> given > >> > to the distance measure > >> > > >> > So will TanimotoDistanceMeasure really work in KMeans clustering in > >> Hadoop? > >> > > >> > See this link for when Tanimoto is really a proper distance measure: > >> > http://en.wikipedia.org/wiki/Jaccard_index > >> > > >> >