mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Viral Parikh <viral.j.par...@gmail.com>
Subject Few Questions related Mahout used for Text Clustering
Date Wed, 03 Dec 2014 23:40:55 GMT
Hi Mahout Users!



Firstly, this community is great and appreciate all the Q & A back and
forth!



I am currently working on Text Clustering and I am using Mahout and
Clustering algorithms (kmeans, krunner, canopy etc) for that.



If anyone has worked on a similar project please let me know. I have a 2
questions as below –



1. In order to choose optimal K, I am running krunner across my vectorized
dataset. In order to choose the right “k”, I am trying to understand the
spread of my observations across all clusters and minimize cluster 1 (which
apparently looks like the catch-all bucket – can anyone confirm?), but I am
observing the final count varies depending on k. See below (please ignore
the blank cells) –



Any idea why the final count varies depending on chosen k?



 [image: Inline image 1]



2. Another thing I noticed, some of my clusters have just n=1 observation?
That doesn’t make sense to me. Is there a way to avoid this, any particular
parameter selection I can tweak?



Thank you and looking forward to your reply.





Cheers,

Viral

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message