mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Goel <ankitgoel2...@gmail.com>
Subject Re: Kmeans clusterdump Interpretation
Date Tue, 21 Jul 2015 04:33:44 GMT
True that. Kmeans is just a first step anyways. Definetely needs tuning.
Thanks guys

On Tue, Jul 21, 2015 at 9:46 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> You can always just pick the article closest to the centroid.
>
> But I think that you may find that with simple k-means that clusters are
> going to be about more than one thing.
>
>
>
> On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel <ankitgoel2004@gmail.com>
> wrote:
>
> > Hmm, kmeans algorithmically is supposed to only annoint existing
> > vectors(documents) as the centroid for a cluster every step (or so I
> > believe). If mahout is generating non document vector as a centroid, it
> > changes a lot of things.
> >
> > That would also explain the -distanceMeasure option in clusterdump. As
> > Andrew mentions, running clusterdump with the default euclidean measure
> > should give me the closest document vector to the calculated centroid.
> > Please correct me if I'm wrong anywhere.
> > Thanks
> >
> > On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman <
> > andrew.musselman@gmail.com> wrote:
> >
> > > It's possible you could write a post-processing step to find the
> closest
> > > point to the centroid based on the "distance" property if I'm recalling
> > it
> > > correctly.
> > >
> > > On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel <ankitgoel2004@gmail.com>
> > > wrote:
> > >
> > > > That kind of puts me in a tough position. I was planning to use
> kmeans
> > > as a
> > > > method for aggregating similar articles from multiple news sources,
> and
> > > > then getting a representative article from those. Here I mean similar
> > as
> > > in
> > > > the articles are from different news sources but are about the exact
> > same
> > > > thing. Intuitively it seems that these articles would get grouped
> > > > together. Any suggestions how I should go about that? So far I'm
> using
> > > > nutch to crawl, solr to index and now I'm here on mahout.
> > > >
> > > > On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <ted.dunning@gmail.com>
> > > > wrote:
> > > >
> > > > > The most central point in a cluster is often referred to as a
> medoid
> > > > > (similar to median, but multi-dimensional).
> > > > >
> > > > > The Mahout code does not compute medoids.  In general, they are
> > > difficult
> > > > > to compute and implementing a full k-medoid clustering algorithm
> even
> > > > more
> > > > > so.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <
> ankitgoel2004@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Oh, I thought kmeans gave me a point vector as a centroid, not
a
> > > > > calculated
> > > > > > point central to a cluster. I guess in this case I would be
> looking
> > > for
> > > > > the
> > > > > > most central point vector (from the index ) that I can use as
a
> > > > > > representative of the cluster.
> > > > > >
> > > > > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > > > > > andrew.musselman@gmail.com> wrote:
> > > > > >
> > > > > > > I'm not sure centroid id is even a defined thing, especially
> > since
> > > > the
> > > > > > > centroid, in my understanding, is just a point in space,
not
> > > > > necessarily
> > > > > > a
> > > > > > > point in your data.
> > > > > > >
> > > > > > > Are you trying to find the most-central point in a given
> cluster?
> > > > > > >
> > > > > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <
> > > ankitgoel2004@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > I've been messing with mahout 0.10 and kmeans clustering
> with a
> > > > solr
> > > > > > > 4.6.1
> > > > > > > > index. The data is news articles. The --field option
for
> kmeans
> > > is
> > > > > set
> > > > > > to
> > > > > > > > "content". The idField is set to "title" (just so
i can
> analyse
> > > it
> > > > > > > faster).
> > > > > > > > The clusterdump of the kmeans result gives me a proper
> output,
> > > but
> > > > I
> > > > > > cant
> > > > > > > > figure out the id of the vector chosen as the center.
There
> are
> > > > only
> > > > > > > 14-15
> > > > > > > > articles so I am not hung up about the cluster performance
at
> > > this
> > > > > > time.
> > > > > > > >
> > > > > > > > I used random seeds for the kmeans commandline.
> > > > > > > > For reference, this is the commandline cluster dump
I am
> > > executing
> > > > > > > >
> > > > > > > > bin/mahout clusterdump -i
> > > $MAHOUT_HOME/testCluster/clusters-3-final
> > > > > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d
> > > > $MAHOUT_HOME/dict.txt
> > > > > > -b 5
> > > > > > > >
> > > > > > > > The output I get is off the form
> > > > > > > >
> > > > > > > > :{"r":
> > > > > > > >
> > > > > > > > top terms
> > > > > > > >
> > > > > > > > xxxxx==>xxxxx
> > > > > > > >
> > > > > > > > Weight : [props - optional]:  Point:
> > > > > > > >
> > > > > > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other
> features]
> > > > > > > >
> > > > > > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > > > > > >
> > > > > > > >
> > > > > > > > So how exactly do I get the centroid id? I have even
tried
> > > > accessing
> > > > > it
> > > > > > > > with java
> > > > > > > >
> > > > > > > > ClusterWritable value.getValue().getCenter() but this
just
> > gives
> > > me
> > > > > the
> > > > > > > > features and values of the centroid.
> > > > > > > >
> > > > > > > > Also, please do explain the meaning of "account":0.026
(just
> > > making
> > > > > > sure
> > > > > > > I
> > > > > > > > know it right). I used tfidf.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > > Ankit Goel
> > > > > > > > http://about.me/ankitgoel
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Ankit Goel
> > > > > > http://about.me/ankitgoel
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Ankit Goel
> > > > http://about.me/ankitgoel
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Ankit Goel
> > http://about.me/ankitgoel
> >
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message