mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Kmeans clustering with Tanimoto distance measure in Mahout
Date Thu, 21 Jun 2012 14:30:12 GMT
Rather than reinvent the wheel here, I'd stick to more well-understood metrics.

I did my homework and indeed the generalized Tanimoto distance is not
a distance metric. It would be, if all values were 0 or 1. So, try
rounding the vector coordinates to 0 or 1. You have anecdotal evidence
that this is an improvement, and I still think there's a theoretical
problem to be fixed here indeed, which such a change would address
too.

On Thu, Jun 21, 2012 at 3:02 PM, Shlomy Boshy <shlomyb@outbrain.com> wrote:
> Thx Yuval
>
> I already implemented a different way to create a cluster center and it
> works great,
> but I wanted to use the existing Mahout implementation and not develop my
> own if possible...
>
> I think it is not the Tanimoto distance measure that gives such a centroid
> - I think it is the way a cluster center is created from points
> (which currently doesnt depend at all in the distance measure)
>
> It creates a center "mathematically legal" for kmeans
> on Real numbers metrics
> but doesnt have much sense in my my problem
> which is a discrete problem
>
> i.e. it will not cluster users into clusters by having them read the same
> documents well enough
>
> [Im not interested in the centroid (except for visualization and analysis
> which is also important)  - it is just that the clusters created seems
> logically wrong to me, and most users were clustered into 1 or 2 clusters
> with many-many features - too many - with low weights. My new
> implementation didnt suffer from this problem]
>
> I will think on this some more
>
> Thx all for your help on this issue

Mime
View raw message