mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zehao Jin" <zehao...@gmail.com>
Subject A Mahout Naive Bayes classifier problem
Date Fri, 04 May 2012 13:08:32 GMT
Dear all,
I'm a mahout beginner, I need to use the mahout Naive Bayes classifier for text classification.To
get started, I followed the example of Twenty NewsGroup:
1.Start the Hadoop clusters.
2.Run the 20 newsgroup example by executing the script:  $./examples/bin/build-20news-bayes.sh
,and chose Naive Bayes method.
3.Finally I got the result same Confusion Matrix as you put here:https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
But I have to classifier the Chinese texts, I had no clue, so I read the shell script:examples/bin/build-20news-bayes.sh
and I knew how this example processed.Then I did like the script:
1.Preparing Training Data.
The script use org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the E-mail
texts and gets one document per line,the label and the words,you know,the Chinese is different
from English,the words cannot splitted by a space,different combination have different meaning,
so I used a Chinese text analyzer to split the words, and match the format. Each line is like
this: Label+'\t'+word1 word2 ....+'\n';
The example's analyzer output :

And the Chinese anlyzer output:


2.Put the formatted train data and the test data to HDFS.(My Hadoop platform has 1 namenode
and 4 datanodes on Fedora 14)The example have 20 categories, and my corpus has 10 categoris:
The example:              My categories:

3.Train the classifier and test the classifier on Hadoop.
The example do like this:
  ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o /20news-bydate/bayes-model
-type bayes -ng 1 -source hdfs
  ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d /20news-bydate/bayes-test-input
-type bayes -ng 1 -source hdfs -method mapreduce
And my commands are absolutely accord the example,the only difference is the directory.

Strangely I cannot get the result as the example,I ran the program several times, but the
mapreduce job always fail!
Task xxx failed to report status for 600 seconds.Killing.

What I want to ask that are the mahout trainclassifer (./bin/mahout trainclassifier xxx)and
testclassifier(  ./bin/mahout testclassifier xxx) codes fit for my program ? Or it can only
be used by the 20 newsgroup example? if they cannot be used ,it's really hard for me to achieve
the Naive Bayes algorithm...Or is it the charset problems ? Many problems are occurred by
this. Can you give me some support? I scratched my head for a few days. Thank you very much!!!



Zehao Jin,SCUT , China.
Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message