mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Videnova, Svetlana" <>
Subject clustering with kmeans, java app
Date Thu, 02 Aug 2012 08:57:53 GMT

I’m doing java app for clustering my data with kmeans.

Those are the steps:


LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my .txt, output
index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx
and the most important who will be using by mahout .tvf) and vectors looking like that (€ðàó^æVG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(2):{
[… and others])

Does anyone please can confirm me that the output format looks good? If no, what the vectors
generated by lucene.vector should look like?

This is part of the code :
/*Creating vectors*/
                               Map vectorMap = new TreeMap();
                               IndexReader reader =;
                               int numDoc = reader.maxDoc();
                               for(int i = 0; i < numDoc;i++){

                                               TermFreqVector termFreqVector = reader.getTermFreqVector(i,



MainClass : Create clusters with mahout, input – path of vectors (the vectors generated
by step 1 see above) , output -  clusters (looking like : for the moment does not create any
clusters cause of this error :
Exception in thread "main" File file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data
does not exist.
      at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(
      at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(
      at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(
      at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(
      at org.apache.hadoop.mapred.JobClient.writeNewSplits(
      at org.apache.hadoop.mapred.JobClient.submitJobInternal(
      at org.apache.hadoop.mapreduce.Job.submit(
      at org.apache.hadoop.mapreduce.Job.waitForCompletion(
      at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(
      at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(
      at main.MainClass.main(

Does anyone please can help me to solve this exception? I can’t understand why data could
not be created… while I’m using hadoop and mahout libs on windows (and I’m admin so
should not be problem of rights).

This is part of the code :

            Pair<Long[], List<Path>> calculate =TFIDFConverter.calculateDF(new
Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
conf, chuckSize);

            TFIDFConverter.processTfIdf(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, sequentialAccessOutput,
false, reduceTasks);

            Path vectorFolder = new Path("output");
            Path canopyCentroids = new Path(outputDir, "canopy-centroids");

            Path clusterOutput = new Path(outputDir, "clusters");

  , canopyCentroids, new EuclideanDistanceMeasure(),
250, 120, false,3,false);

  , vectorFolder, new Path(canopyCentroids,"clusters-0"), clusterOutput,
new TanimotoDistanceMeasure(), 0.01, 20, true,3, false);

Thank you for your time


Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended recipient(s) only. It
may contain proprietary material, confidential information and/or be subject to legal privilege.
It should not be copied, disclosed to, retained or used by, any other party. If you are not
an intended recipient then please promptly delete this e-mail and any attachment and all copies
and inform the sender. Thank you.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message