mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kate Ericson <>
Subject Re: question about clustering
Date Mon, 03 Oct 2011 15:04:07 GMT
Hi Welde,

As a disclaimer, I only know enough to try to help you figure out your
first problem.
First of all, can you tell us about the dataset you are using?
How many points are you clustering?

As a guess without knowing either of these things, part of the reason
why your clusters look the same is that you're only clustering around
3 points.  You're only running for 2 iterations, so it looks like its
just not moving your cluster centers around at all.  Can you try again
with a larger k?
This may let it run for more iterations so you should be able to see
more changes in results.

Good luck!


On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <> wrote:
> Hi ,
> i have used mahout to produce kmeans  clustering for my tf-idf result. I use
> the mahout command line to produce the clusters and it seems it successfully
> completes.
> $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
> ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> It seems there are two clusters directory generated.(cluster-1 and
> cluster-2)  , when i use clusterdump on each of them, it seems to me that
> the clustered top terms are the same. Any idea why ?
> Also, how can i see which documents have been assigned to each cluster.
> Right now, i can see the number of documents assigned but not the complete
> list.
> Most importantly, for production purposes, i assume it makes sense for
> kmeans always runs on hadoop to generate the clustering file. But how do i
> consume these during serving ? Ideally, serving should have the doc id or
> query passed as a query, and the server should return the top document
> ranked by the score within the same cluster back. How do I do it in code ?
> Any good examples ?
> Thanks a lot,
> Weide

View raw message