mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramiro <rma...@pragsis.com>
Subject Clustering points with their point id
Date Wed, 21 May 2014 07:50:08 GMT
Hello,

I have been trying to use the mahout k-means to cluster some synthetic 
data (A3-set <https://cs.joensuu.fi/sipu/datasets/>) as testing 
reference but seems I'm missing something here. I'm quite the beginner 
so bear with me, please. Before all the explanations, what I want to 
know is:

A) Is there a better way to cluster a series of numeric points and 
maintain their id value?
B) What I'm doing wrong with this approach?

-Thanks for your patience



This is a sample of the  data:

point1,53920,42968
point2,52019,42206
point3,52570,42476
point4,54220,42081
point5,54268,43420
point6,52288,42408
point7,54436,39727
point8,52391,44323
point9,54995,43655
point10,53761,43403

As you can see, is in the format: id_point, value1, value2 (the comma 
doesn't matter, I could use tab separators)

All testing data that I have seen that uses numeric data (not the 
reuters example) doesn't have a id_column of sorts to identify the point 
after the clusterdump, just the values. Is there a way to map the points 
to a ID using the command line?

In stackoverflow 
<https://stackoverflow.com/questions/8785392/how-to-perform-k-means-clustering-in-mahout-with-vector-data-stored-as-csv>

I saw a code sample to create my own fileSequencer using namedVectors to 
capture the ID, I created a similar version (code at the end), and 
creates in the HDFS a sequenceFile

sample:

hadoop fs -text  /tmp/mahout-another/input/testSequence/ | head
point1  point1:{0:53920.0,1:42968.0}
point2  point2:{0:52019.0,1:42206.0}
point3  point3:{0:52570.0,1:42476.0}
point4  point4:{0:54220.0,1:42081.0}
point5  point5:{0:54268.0,1:43420.0}
point6  point6:{0:52288.0,1:42408.0}
point7  point7:{0:54436.0,1:39727.0}
point8  point8:{0:52391.0,1:44323.0}
point9  point9:{0:54995.0,1:43655.0}
point10 point10:{0:53761.0,1:43403.0}


Then I cluster with:

mahout kmeans -k 50 -i /tmp/mahout-another/input/testSequence/ -o 
/tmp/mahout-another/output -c /tmp/mahout-initial-clusters --maxIter 10

After completion, the clusterdump with:

mahout clusterdump -i /tmp/mahout-another/output/clusters-10-final -o 
output.txt -p /tmp/mahout-another/input/testSequence

throws me:

Exception in thread "main" java.lang.ClassCastException: 
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable

I'm guessing that the problem lies in the SequenceFile writer (code 
below) since with requires two class parameters, and following the 
stackoverflow example I write the name of the vector as a Text.class and 
the vector itself after it. The clusterdump then tries to read the 
points, finds text and an exception is thrown, but if that's the case, 
how can I do it?




Code for the sequence writer with namedVectors.:


main:

BufferedReader br = new BufferedReader(new FileReader(args[0]));
             String line;
             List<NamedVector> vector = new ArrayList<NamedVector>();
             while ((line = br.readLine()) != null) {
                     String[] tokenized = line.split(args[2]);
                     String id = tokenized[INDEX_ID];
                     double[] coordinates = new double[tokenized.length-1];
                     for (int i=1;i<tokenized.length;i++){
coordinates[i-1]=Double.valueOf(tokenized[i]);
                     }
                     vector.add(createNamedVector(coordinates,id));
                 }
             writeSequenceToPath(args[1], vector);
                 br.close();


public static NamedVector createNamedVector(double[] points, String id){
         return new NamedVector(new DenseVector(points),id);
     }

     public static void writeSequenceToPath(String directory, 
List<NamedVector> listOfVectors) throws IOException{
         Configuration config = new Configuration();
         FileSystem fs = FileSystem.get(config);

         Path path = new Path(directory);
         //write a SequenceFile form a Vector
*        SequenceFile.Writer writer = new SequenceFile.Writer(fs, 
config, path, Text.class, VectorWritable.class);*
         VectorWritable vec = new VectorWritable();
         for(NamedVector v:listOfVectors){
             vec.set(v);
             writer.append(new Text(v.getName()), vec);
         }
         writer.close();

     }




-- 
*Ramiro Manso*
/Data Analyst & BI consultant/

Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain

_http://www.bidoop.es_


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message