mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shlomy Boshy <shlo...@outbrain.com>
Subject Re: Kmeans clustering with Tanimoto distance measure in Mahout
Date Thu, 21 Jun 2012 11:36:38 GMT
Interesting - to which thresholds do you refer?



On Thu, Jun 21, 2012 at 2:24 PM, Sean Owen <srowen@gmail.com> wrote:

> Granted I too may be missing something since I am not familiar with
> the code so much, but...
>
> The Tanimoto "distance" isn't a proper distance metric is it? not when
> defined over real-valued vectors like it is here. That seems like the
> root issue here. I'm pretty sure we need a distance metric in
> clustering right?
>
> It's not that the Tanimoto distance isn't defined on these vectors --
> it is, or can be, and is in Mahout. It's also not a problem to take
> the centroid as a mean of vectors; that's valid.
>
> The problem you describe seems like a matter of not setting the
> clustering thresholds quite right.
>
>
> On Thu, Jun 21, 2012 at 11:39 AM, Shlomy Boshy <shlomyb@outbrain.com>
> wrote:
> > Yuval - Thx for your reply
> >
> > It works technically but I believe it doesnt give good logical results -
> I
> > think it has a logical issue when it comes to this distance measure
> >
> > The problem is that the current kmeans optimization in Mahout
> > does connect points to clusters by the distance measure - so I can use a
> > distance measure like Tanimoto which handles well binary measures
> (Jaccard
> > Index etc.)
> > BUT for making new cluster centers
> > it *always *uses an arithmetic average in each feature
> > (i.e. sum of values in the measure X in all points belonging to the
> > cluster, divided by the number of points)
> > for any distance measure
> > [I read the code - I hope my understanding was right]
> >
> > This is good for Euclidean distance
> >
> > but for binary measures / Jaccard index distance function
> > I believe this doesnt create a cluster center well.
> >
> > A cluster center should be a point "close" to many of the points in the
> > cluster by the distance measure
> >
> > Also by the logical target of the Tanimoto measure it needs to "close" in
> > the sense of high intersection with the points, and low union (Jaccard
> > Index).
> >
> > If I use Jaccard index / Tanimoto as the distance measure
> > It think I want a cluster center which reflects the following property:
> > the features in the cluster center which are non-zero should be the
> > features appearing in most of the points in the cluster
> > (i.e. if we order the features by their occurences in points in the
> cluster
> > we want the top ones in the cluser center)
> >
> > This way the cluster center will become "close" to most points in the
> > cluster in the Tanimoto distance measure:
> > it will have a large intersection with many of the points, and not so
> large
> > union with them
> > (The issue is how much of these features to take into the cluster center
>  -
> > I figured the avg number of features in a point * some ratio)
> >
> > (I can state it more formally - this is just a short description of the
> > idea...)
> >
> > When using the current kmeans implementation in Kmeans
> > I got cluster centers with many-many features with very low weights
> (0.001
> > etc.) - which didnt get me to clustering of binary measures as I wanted
> > points were not divided well between clusters (all into one huge cluster)
> >
> > Maybe there is a mathematical meaning - but it doesnt logically solve the
> > problem I'm trying to solve
> > (cluster users to clusters by the documents they read - that is why it is
> > binary)
> >
> > So I implemented myself kmeans for binary measures with the solution
> > describes above
> > and it worked much better - I got small cluster centers - binary (0/1 in
> > the value)
> > points where distributed well between clusters
> >
> >
> > What do you think of this problem/solution?
> >
> > If more think this is an interesting direction I might want to contribute
> > it into the Mahout library.
> > Would it interest anyone?
> >
> > I think it focuses on a different way to create a cluster center for
> points
> > when your distance measure deals with binary measures and is based on
> > Jaccard Index / Tanimoto distance measure
> >
> >
> >
> >
> > On Thu, Jun 21, 2012 at 1:03 PM, Yuval Feinstein <yuvalf@citypath.com
> >wrote:
> >
> >> Hi Shlomy.
> >> According to the documentation:
> >>
> >>
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/distance/TanimotoDistanceMeasure.html
> >> The code uses the Tanimoto formula based on the inner multiple and
> norms.
> >> Therefore, you get a distance value for every pair of  vectors, even if
> the
> >> cluster centroids have coordinates different than 0 and 1.
> >> Of course, you can go and read the source code, which I have not done,
> and
> >> further check this.
> >> Or just run an experiment.
> >> Cheers,
> >> Yuval
> >>
> >>
> >> On Thu, Jun 14, 2012 at 11:06 PM, Shlomy Boshy <shlomyb@outbrain.com>
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > Im doing Kmeans clustering in Mahout using Tanimoto distance measure
> >> >
> >> > My input are feature vectors for which the indexes are the features
> and
> >> the
> >> > value is 1 for features that exist in the sample, and 0 for
> non-existing
> >> > features
> >> > (it is actually clustering of users by documents they read, so for
> each
> >> > user we have 1 in the documents that he read)
> >> >
> >> > So the input vectors are only 0 or 1
> >> >
> >> > By the output clusters are double values - not only 0 and 1
> >> > and in the kmeans iterations I guess Kmeans move the cluster centers
> to
> >> > various values for all features - not only 0 and 1
> >> >
> >> > So will the Tanimoto distance measure work in this case?
> >> > I think it only gives the Jaccard Index when the values are 0 and 1
> >> > (else it will not reflect the ratio between intersection and union of
> the
> >> > features in the 2 points)
> >> >
> >> > If I add feature weights even more it will not be only 0 or 1 values
> >> given
> >> > to the distance measure
> >> >
> >> > So will TanimotoDistanceMeasure really work in KMeans clustering in
> >> Hadoop?
> >> >
> >> > See this link for when Tanimoto is really a proper distance measure:
> >> > http://en.wikipedia.org/wiki/Jaccard_index
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message