mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jesvin Jose <>
Subject Why are clustering emails not clustering similar stuff?
Date Thu, 06 Jun 2013 05:47:52 GMT
I tried to cluster 1000 emails of a person using Kmeans, but clusters are
not forming okay. For example if Facebook sends notifications about James
Doe and 5 other people, I get 5 clusters like:

    Top Terms:
        doe                                   =>  10.066998481750488
        james                                =>  10.066998481750488

Why are notifications for all 5 people not getting clustered together? I
used variants of the commands used in Mahout in Action, Sean Owen et al as

Vectorizing uses lowercasing, stop words and length filter:

bin/hadoop jar
org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o
mymail-vectors-bigram -ow  -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt
tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq

Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get
similar results but half the number of emails "get into" any cluster.

bin/hadoop jar
org.apache.mahout.driver.MahoutDriver kmeans -i
mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o
mymail-kmeans-clusters-from-bigrams -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x
20 -cl

We dont beat the reaper by living longer. We beat the reaper by living well
and living fully. The reaper will come for all of us. Question is, what do
we do between the time we are born and the time he shows up? -Randy Pausch

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message