mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <yuv...@citypath.com>
Subject Re: clustering with kmeans, java app
Date Tue, 07 Aug 2012 06:15:38 GMT
I spent a week trying to get Hadoop to work on Windows 7, and then gave up.
Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work?
http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has
lots of details about this.
Some of the possible problems are cygwin paths (!= linux paths),
hdfs/local filesystem confusion, your hadoop user (!= your user
permissions-wise), or other things
listed at the link above.
Good luck,
Yuval

On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana
<svetlana.videnova@logica.com> wrote:
>
> Hello,
>
> I’m doing java app for clustering my data with kmeans.
>
> Those are the steps:
>
> 1)
>
> LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my .txt,
output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd,
.tvx and the most important who will be using by mahout .tvf) and vectors looking like that
(SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æVG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(2):{
[… and others])
>
> Does anyone please can confirm me that the output format looks good? If no, what the
vectors generated by lucene.vector should look like?
>
> This is part of the code :
> /*Creating vectors*/
>                                Map vectorMap = new TreeMap();
>                                IndexReader reader = IndexReader.open(index);
>                                int numDoc = reader.maxDoc();
>                                for(int i = 0; i < numDoc;i++){
>
>
>                                                TermFreqVector termFreqVector = reader.getTermFreqVector(i,
"content");
>                                                addTermFreqToMap(vectorMap,termFreqVector);
>
>                                }
>
>
>
>
> 2)
>
>
> MainClass : Create clusters with mahout, input – path of vectors (the vectors generated
by step 1 see above) , output -  clusters (looking like : for the moment does not create any
clusters cause of this error :
> Exception in thread "main" java.io.FileNotFoundException: File file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data
does not exist.
>       at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>       at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>       at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>       at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
>       at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>       at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
>       at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
>       at main.MainClass.main(MainClass.java:144))
>
>
> Does anyone please can help me to solve this exception? I can’t understand why data
could not be created… while I’m using hadoop and mahout libs on windows (and I’m admin
so should not be problem of rights).
>
>
> This is part of the code :
>
>
>             Pair<Long[], List<Path>> calculate =TFIDFConverter.calculateDF(new
Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
conf, chuckSize);
>
>             TFIDFConverter.processTfIdf(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, sequentialAccessOutput,
false, reduceTasks);
>
>             Path vectorFolder = new Path("output");
>             Path canopyCentroids = new Path(outputDir, "canopy-centroids");
>
>             Path clusterOutput = new Path(outputDir, "clusters");
>
>             CanopyDriver.run(vectorFolder, canopyCentroids, new EuclideanDistanceMeasure(),
250, 120, false,3,false);
>
>             KMeansDriver.run(conf, vectorFolder, new Path(canopyCentroids,"clusters-0"),
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true,3, false);
>
>
> Thank you for your time
>
>
>
>
> Regards
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended recipient(s) only.
It may contain proprietary material, confidential information and/or be subject to legal privilege.
It should not be copied, disclosed to, retained or used by, any other party. If you are not
an intended recipient then please promptly delete this e-mail and any attachment and all copies
and inform the sender. Thank you.
>

Mime
View raw message