mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Kmeans clusterdump Interpretation
Date Tue, 21 Jul 2015 01:40:31 GMT
The most central point in a cluster is often referred to as a medoid
(similar to median, but multi-dimensional).

The Mahout code does not compute medoids.  In general, they are difficult
to compute and implementing a full k-medoid clustering algorithm even more
so.



On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <ankitgoel2004@gmail.com> wrote:

> Oh, I thought kmeans gave me a point vector as a centroid, not a calculated
> point central to a cluster. I guess in this case I would be looking for the
> most central point vector (from the index ) that I can use as a
> representative of the cluster.
>
> On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
>
> > I'm not sure centroid id is even a defined thing, especially since the
> > centroid, in my understanding, is just a point in space, not necessarily
> a
> > point in your data.
> >
> > Are you trying to find the most-central point in a given cluster?
> >
> > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <ankitgoel2004@gmail.com>
> > wrote:
> >
> > > Hi,
> > > I've been messing with mahout 0.10 and kmeans clustering with a solr
> > 4.6.1
> > > index. The data is news articles. The --field option for kmeans is set
> to
> > > "content". The idField is set to "title" (just so i can analyse it
> > faster).
> > > The clusterdump of the kmeans result gives me a proper output, but I
> cant
> > > figure out the id of the vector chosen as the center. There are only
> > 14-15
> > > articles so I am not hung up about the cluster performance at this
> time.
> > >
> > > I used random seeds for the kmeans commandline.
> > > For reference, this is the commandline cluster dump I am executing
> > >
> > > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt
> -b 5
> > >
> > > The output I get is off the form
> > >
> > > :{"r":
> > >
> > > top terms
> > >
> > > xxxxx==>xxxxx
> > >
> > > Weight : [props - optional]:  Point:
> > >
> > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > >
> > > 1.0 : [distance=0.3963903651622338]: [....]
> > >
> > >
> > > So how exactly do I get the centroid id? I have even tried accessing it
> > > with java
> > >
> > > ClusterWritable value.getValue().getCenter() but this just gives me the
> > > features and values of the centroid.
> > >
> > > Also, please do explain the meaning of "account":0.026 (just making
> sure
> > I
> > > know it right). I used tfidf.
> > >
> > > --
> > > Regards,
> > > Ankit Goel
> > > http://about.me/ankitgoel
> > >
> >
>
>
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message