mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: A Mahout Naive Bayes classifier problem
Date Sat, 05 May 2012 00:23:12 GMT
Yes, it could be the charset problem. Also, it could be the total
number of terms you supply.

Which analyzer do you use? It is the Lucene "CJKAnalyzer"? This
creates bigrams of all successive words, and so the number of unique
terms explodes. This will cause the Hadoop job to explode. The "Smart
Chinese Analyzer" uses a trained model to split words into 1-, 2- and
3-word clusters. The "Standard Analyzer" will split all CJK words into
single terms. Given that this is a Bayesian model, the Bayesian
assumption would be that single terms are good enough. I would go with
the StandardAnalyzer.

(I learned all of this just now in my day job in the Lucene business.)

On Fri, May 4, 2012 at 6:32 AM, Robin Anil <robin.anil@gmail.com> wrote:
> Can you provide the console output when you run train or test
> On May 4, 2012 8:09 AM, "Zehao Jin" <zehaojin@gmail.com> wrote:
>
>> **
>> Dear all,
>> I'm a mahout beginner, I need to use the mahout Naive Bayes classifier for
>> text classification.To get started, I followed the example of Twenty
>> NewsGroup:
>> 1.Start the Hadoop clusters.
>> 2.Run the 20 newsgroup example by executing the script:
>> $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method.
>> 3.Finally I got the result same Confusion Matrix as you put here:
>> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
>> But I have to classifier the Chinese texts, I had no clue, so I read the
>> shell script:examples/bin/build-20news-bayes.sh and I knew how this example
>> processed.Then I did like the script:
>> 1.Preparing Training Data.
>> The script use
>> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the
>> E-mail texts and gets one document per line,the label and the words,you
>> know,the Chinese is different from English,the words cannot splitted by a
>> space,different combination have different meaning, so I used a Chinese
>> text analyzer to split the words, and match the format. Each line is like
>> this: Label+'\t'+word1 word2 ....+'\n';
>> The example's analyzer output :
>>  And the Chinese anlyzer output:
>>
>> 2.Put the formatted train data and the test data to HDFS.(My Hadoop
>> platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20
>> categories, and my corpus has 10 categoris:
>> The example:              My categories:
>>
>> 3.Train the classifier and test the classifier on Hadoop.
>> The example do like this:
>>
>> ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o /20news-bydate/bayes-model
-type bayes -ng 1 -source hdfs
>>
>>   ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d /20news-bydate/bayes-test-input
-type bayes -ng 1 -source hdfs -method mapreduce
>> And my commands are absolutely accord the example,the only difference is
>> the directory.
>>
>> Strangely I cannot get the result as the example,I ran the program several
>> times, but the mapreduce job always fail!
>> Task xxx failed to report status for 600 seconds.Killing.
>>
>> What I want to ask that are the mahout trainclassifer (
>> ./bin/mahout trainclassifier xxx)and testclassifier(  ./bin/mahout testclassifier
>> xxx) codes fit for my program ? Or it can only be used by the 20
>> newsgroup example? if they cannot be used ,it's really hard for me to
>> achieve the Naive Bayes algorithm...Or is it the charset problems ? Many
>> problems are occurred by this. Can you give me some support? I scratched my
>> head for a few days. Thank you very much!!!
>> ------------------------------
>>  Zehao Jin,SCUT , China.
>>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message