mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Goel <>
Subject Kmeans clusterdump Interpretation
Date Tue, 21 Jul 2015 00:18:57 GMT
I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1
index. The data is news articles. The --field option for kmeans is set to
"content". The idField is set to "title" (just so i can analyse it faster).
The clusterdump of the kmeans result gives me a proper output, but I cant
figure out the id of the vector chosen as the center. There are only 14-15
articles so I am not hung up about the cluster performance at this time.

I used random seeds for the kmeans commandline.
For reference, this is the commandline cluster dump I am executing

bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
-p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5

The output I get is off the form


top terms


Weight : [props - optional]:  Point:

 1.0 : [distance=0.0]: [{"account":0.026}.......other features]

1.0 : [distance=0.3963903651622338]: [....]

So how exactly do I get the centroid id? I have even tried accessing it
with java

ClusterWritable value.getValue().getCenter() but this just gives me the
features and values of the centroid.

Also, please do explain the meaning of "account":0.026 (just making sure I
know it right). I used tfidf.

Ankit Goel

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message