mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Chang <>
Subject question about clustering
Date Mon, 03 Oct 2011 03:52:58 GMT
Hi ,

i have used mahout to produce kmeans  clustering for my tf-idf result. I use
the mahout command line to produce the clusters and it seems it successfully

$MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
./kmeans-clusters  -cd 1.0 -k 3 -x 1000

It seems there are two clusters directory generated.(cluster-1 and
cluster-2)  , when i use clusterdump on each of them, it seems to me that
the clustered top terms are the same. Any idea why ?

Also, how can i see which documents have been assigned to each cluster.
Right now, i can see the number of documents assigned but not the complete

Most importantly, for production purposes, i assume it makes sense for
kmeans always runs on hadoop to generate the clustering file. But how do i
consume these during serving ? Ideally, serving should have the doc id or
query passed as a query, and the server should return the top document
ranked by the score within the same cluster back. How do I do it in code ?
Any good examples ?

Thanks a lot,


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message