mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramiro <>
Subject Clustering points with their point id
Date Wed, 21 May 2014 07:50:08 GMT

I have been trying to use the mahout k-means to cluster some synthetic 
data (A3-set <>) as testing 
reference but seems I'm missing something here. I'm quite the beginner 
so bear with me, please. Before all the explanations, what I want to 
know is:

A) Is there a better way to cluster a series of numeric points and 
maintain their id value?
B) What I'm doing wrong with this approach?

-Thanks for your patience

This is a sample of the  data:


As you can see, is in the format: id_point, value1, value2 (the comma 
doesn't matter, I could use tab separators)

All testing data that I have seen that uses numeric data (not the 
reuters example) doesn't have a id_column of sorts to identify the point 
after the clusterdump, just the values. Is there a way to map the points 
to a ID using the command line?

In stackoverflow 

I saw a code sample to create my own fileSequencer using namedVectors to 
capture the ID, I created a similar version (code at the end), and 
creates in the HDFS a sequenceFile


hadoop fs -text  /tmp/mahout-another/input/testSequence/ | head
point1  point1:{0:53920.0,1:42968.0}
point2  point2:{0:52019.0,1:42206.0}
point3  point3:{0:52570.0,1:42476.0}
point4  point4:{0:54220.0,1:42081.0}
point5  point5:{0:54268.0,1:43420.0}
point6  point6:{0:52288.0,1:42408.0}
point7  point7:{0:54436.0,1:39727.0}
point8  point8:{0:52391.0,1:44323.0}
point9  point9:{0:54995.0,1:43655.0}
point10 point10:{0:53761.0,1:43403.0}

Then I cluster with:

mahout kmeans -k 50 -i /tmp/mahout-another/input/testSequence/ -o 
/tmp/mahout-another/output -c /tmp/mahout-initial-clusters --maxIter 10

After completion, the clusterdump with:

mahout clusterdump -i /tmp/mahout-another/output/clusters-10-final -o 
output.txt -p /tmp/mahout-another/input/testSequence

throws me:

Exception in thread "main" java.lang.ClassCastException: cannot be cast to

I'm guessing that the problem lies in the SequenceFile writer (code 
below) since with requires two class parameters, and following the 
stackoverflow example I write the name of the vector as a Text.class and 
the vector itself after it. The clusterdump then tries to read the 
points, finds text and an exception is thrown, but if that's the case, 
how can I do it?

Code for the sequence writer with namedVectors.:


BufferedReader br = new BufferedReader(new FileReader(args[0]));
             String line;
             List<NamedVector> vector = new ArrayList<NamedVector>();
             while ((line = br.readLine()) != null) {
                     String[] tokenized = line.split(args[2]);
                     String id = tokenized[INDEX_ID];
                     double[] coordinates = new double[tokenized.length-1];
                     for (int i=1;i<tokenized.length;i++){
             writeSequenceToPath(args[1], vector);

public static NamedVector createNamedVector(double[] points, String id){
         return new NamedVector(new DenseVector(points),id);

     public static void writeSequenceToPath(String directory, 
List<NamedVector> listOfVectors) throws IOException{
         Configuration config = new Configuration();
         FileSystem fs = FileSystem.get(config);

         Path path = new Path(directory);
         //write a SequenceFile form a Vector
*        SequenceFile.Writer writer = new SequenceFile.Writer(fs, 
config, path, Text.class, VectorWritable.class);*
         VectorWritable vec = new VectorWritable();
         for(NamedVector v:listOfVectors){
             writer.append(new Text(v.getName()), vec);


*Ramiro Manso*
/Data Analyst & BI consultant/

Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message