mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 万代豊 <>
Subject Re: Mahout clustering
Date Wed, 17 Dec 2014 07:47:16 GMT
Hi Shweta
I guess I can handle this.
I always specify namedVector option when generation term vector(seq2sparse)

$MAHOUT_HOME/bin/mahout seq2sparse --namedVector -i MyJob/MyJob-seqfile/ -o
MyJob/MyJob-namedVector -ow -a
org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md
3 -x 90 -ng 2 -ml 50 -seq -n

and then run Kmeans using this Named Vector input like

$MAHOUT_HOME/bin/mahout kmeans -i MyJob/MyJob-namedVector/tfidf-vectors/ -c
MyJob/MyJob-initial-namedVector-clusters -o
MyJob/MyJob-kmeans-namedVector-clusters -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.01 -k 12 -x
20 -cl

The dump the result on your text file as;
 $MAHOUT_HOME/bin/mahout clusterdump --pointsDir
MyJob/MyJob-kmeans-namedVector-clusters/clusteredPoints -dt sequencefile -d
MyJob/MyJob-namedVector/dictionary.file-* -i
MyJob/MyJob-kmeans-namedVector-clusters/clusters-8-final -o
/home/hadoop/MyJob/MyJob-kmeans-namedVector-clusterdump01.txt -b 100 -n 20

Then you should see all the cluster information such as cluster Id., # of
docs. in the cluster, doc.Id in that cluster,top terms,etc.

*Note that this example is from Mahout-0.7.
Try it.
Good luck.

2014-12-08 14:39 GMT+09:00 shweta agrawal <>:
> Hello,
> I am new to mahout. I am working on mahout clustering to detect topic. I
> have done mahout kmeans clustering and i got the top terms of cluster also,
> but i want the document id of the clusters. How to get which document is in
> which cluster?
> Thanks and Regards
> Shweta

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message