mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Viral Parikh <>
Subject Few Questions related Mahout used for Text Clustering
Date Wed, 03 Dec 2014 23:40:55 GMT
Hi Mahout Users!

Firstly, this community is great and appreciate all the Q & A back and

I am currently working on Text Clustering and I am using Mahout and
Clustering algorithms (kmeans, krunner, canopy etc) for that.

If anyone has worked on a similar project please let me know. I have a 2
questions as below –

1. In order to choose optimal K, I am running krunner across my vectorized
dataset. In order to choose the right “k”, I am trying to understand the
spread of my observations across all clusters and minimize cluster 1 (which
apparently looks like the catch-all bucket – can anyone confirm?), but I am
observing the final count varies depending on k. See below (please ignore
the blank cells) –

Any idea why the final count varies depending on chosen k?

 [image: Inline image 1]

2. Another thing I noticed, some of my clusters have just n=1 observation?
That doesn’t make sense to me. Is there a way to avoid this, any particular
parameter selection I can tweak?

Thank you and looking forward to your reply.



  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message