Granted I too may be missing something since I am not familiar with
the code so much, but...
The Tanimoto "distance" isn't a proper distance metric is it? not when
defined over realvalued vectors like it is here. That seems like the
root issue here. I'm pretty sure we need a distance metric in
clustering right?
It's not that the Tanimoto distance isn't defined on these vectors 
it is, or can be, and is in Mahout. It's also not a problem to take
the centroid as a mean of vectors; that's valid.
The problem you describe seems like a matter of not setting the
clustering thresholds quite right.
On Thu, Jun 21, 2012 at 11:39 AM, Shlomy Boshy <shlomyb@outbrain.com> wrote:
> Yuval  Thx for your reply
>
> It works technically but I believe it doesnt give good logical results  I
> think it has a logical issue when it comes to this distance measure
>
> The problem is that the current kmeans optimization in Mahout
> does connect points to clusters by the distance measure  so I can use a
> distance measure like Tanimoto which handles well binary measures (Jaccard
> Index etc.)
> BUT for making new cluster centers
> it *always *uses an arithmetic average in each feature
> (i.e. sum of values in the measure X in all points belonging to the
> cluster, divided by the number of points)
> for any distance measure
> [I read the code  I hope my understanding was right]
>
> This is good for Euclidean distance
>
> but for binary measures / Jaccard index distance function
> I believe this doesnt create a cluster center well.
>
> A cluster center should be a point "close" to many of the points in the
> cluster by the distance measure
>
> Also by the logical target of the Tanimoto measure it needs to "close" in
> the sense of high intersection with the points, and low union (Jaccard
> Index).
>
> If I use Jaccard index / Tanimoto as the distance measure
> It think I want a cluster center which reflects the following property:
> the features in the cluster center which are nonzero should be the
> features appearing in most of the points in the cluster
> (i.e. if we order the features by their occurences in points in the cluster
> we want the top ones in the cluser center)
>
> This way the cluster center will become "close" to most points in the
> cluster in the Tanimoto distance measure:
> it will have a large intersection with many of the points, and not so large
> union with them
> (The issue is how much of these features to take into the cluster center 
> I figured the avg number of features in a point * some ratio)
>
> (I can state it more formally  this is just a short description of the
> idea...)
>
> When using the current kmeans implementation in Kmeans
> I got cluster centers with manymany features with very low weights (0.001
> etc.)  which didnt get me to clustering of binary measures as I wanted
> points were not divided well between clusters (all into one huge cluster)
>
> Maybe there is a mathematical meaning  but it doesnt logically solve the
> problem I'm trying to solve
> (cluster users to clusters by the documents they read  that is why it is
> binary)
>
> So I implemented myself kmeans for binary measures with the solution
> describes above
> and it worked much better  I got small cluster centers  binary (0/1 in
> the value)
> points where distributed well between clusters
>
>
> What do you think of this problem/solution?
>
> If more think this is an interesting direction I might want to contribute
> it into the Mahout library.
> Would it interest anyone?
>
> I think it focuses on a different way to create a cluster center for points
> when your distance measure deals with binary measures and is based on
> Jaccard Index / Tanimoto distance measure
>
>
>
>
> On Thu, Jun 21, 2012 at 1:03 PM, Yuval Feinstein <yuvalf@citypath.com>wrote:
>
>> Hi Shlomy.
>> According to the documentation:
>>
>> https://builds.apache.org/job/MahoutQuality/javadoc/org/apache/mahout/common/distance/TanimotoDistanceMeasure.html
>> The code uses the Tanimoto formula based on the inner multiple and norms.
>> Therefore, you get a distance value for every pair of vectors, even if the
>> cluster centroids have coordinates different than 0 and 1.
>> Of course, you can go and read the source code, which I have not done, and
>> further check this.
>> Or just run an experiment.
>> Cheers,
>> Yuval
>>
>>
>> On Thu, Jun 14, 2012 at 11:06 PM, Shlomy Boshy <shlomyb@outbrain.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > Im doing Kmeans clustering in Mahout using Tanimoto distance measure
>> >
>> > My input are feature vectors for which the indexes are the features and
>> the
>> > value is 1 for features that exist in the sample, and 0 for nonexisting
>> > features
>> > (it is actually clustering of users by documents they read, so for each
>> > user we have 1 in the documents that he read)
>> >
>> > So the input vectors are only 0 or 1
>> >
>> > By the output clusters are double values  not only 0 and 1
>> > and in the kmeans iterations I guess Kmeans move the cluster centers to
>> > various values for all features  not only 0 and 1
>> >
>> > So will the Tanimoto distance measure work in this case?
>> > I think it only gives the Jaccard Index when the values are 0 and 1
>> > (else it will not reflect the ratio between intersection and union of the
>> > features in the 2 points)
>> >
>> > If I add feature weights even more it will not be only 0 or 1 values
>> given
>> > to the distance measure
>> >
>> > So will TanimotoDistanceMeasure really work in KMeans clustering in
>> Hadoop?
>> >
>> > See this link for when Tanimoto is really a proper distance measure:
>> > http://en.wikipedia.org/wiki/Jaccard_index
>> >
>>
