mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Rey <benja...@c-optimal.com>
Subject Re: 92% accuracy on Weka NaiveBayesMultinomial vs 66% with Mahout bayes
Date Sat, 17 Sep 2011 08:30:44 GMT
Thanks a lot for your interest and time.
I'm computer-less for the coming week, but I'll run a few more
experiments and post the data as soon as I'm back home.

Thanks.

Benjamin


Le 17 sept. 2011 à 00:24, Ted Dunning <ted.dunning@gmail.com> a écrit :

> Benjamin,
>
> Can you post your actual training data on dropbox or some other place so
> that we can replicate the problem?
>
> On Fri, Sep 16, 2011 at 3:38 PM, Benjamin Rey <benjamin.rey@c-optimal.com>wrote:
>
>> Unfortunately CNB gives me the same 66% accuracy.
>>
>> I past the commands for mahout and weka below.
>>
>> I also tried to remove the biggest class, it helps but then it's the 2nd
>> biggest class that is overwhelmingly predicted. Mahout bayes seems to favor
>> a lot the biggest class (more than prior), contrarily to Weka's
>> implementation. Is there any choice in the parameters, or in ways of
>> computing weights that could be  causing this?
>>
>> thanks.
>>
>> benjamin
>>
>> here are the commands:
>> On Mahout:
>> # training set, usual prepare20newsgroup, followed by a subsampling for
>> just
>> a few classes
>> bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups   -p
>> examples/bin/work/20news-bydate/20news-bydate-train   -o
>> examples/bin/work/20news-bydate/bayes-train-input   -a
>> org.apache.mahout.vectorizer.DefaultAnalyzer   -c UTF-8
>> mkdir examples/bin/work/20news_ss/bayes-train-input/
>> head -400 examples/bin/work/20news-bydate/bayes-train-input/alt.atheism.txt
>>> examples/bin/work/20news_ss/bayes-train-input/alt.atheism.txt
>> head -200
>> examples/bin/work/20news-bydate/bayes-train-input/comp.graphics.txt >
>> examples/bin/work/20news_ss/bayes-train-input/comp.graphics.txt
>> head -100 examples/bin/work/20news-bydate/bayes-train-input/rec.autos.txt >
>> examples/bin/work/20news_ss/bayes-train-input/rec.autos.txt
>> head -100
>> examples/bin/work/20news-bydate/bayes-train-input/rec.sport.hockey.txt >
>> examples/bin/work/20news_ss/bayes-train-input/rec.sport.hockey.txt
>> head -30 examples/bin/work/20news-bydate/bayes-train-input/sci.med.txt >
>> examples/bin/work/20news_ss/bayes-train-input/sci.med.txt
>> hdput examples/bin/work/20news_ss/bayes-train-input
>> examples/bin/work/20news_ss/bayes-train-input
>>
>>  then same exact thing for testing
>>
>> # actual training:
>> bin/mahout trainclassifier   -i
>> examples/bin/work/20news_ss/bayes-train-input   -o
>> examples/bin/work/20news-bydate/cbayes-model_ss   -type cbayes   -ng 1
>> -source hdfs
>>
>> # testing
>> bin/mahout testclassifier   -d
>> examples/bin/work/20news_ss/bayes-test-input   -m
>> examples/bin/work/20news-bydate/cbayes-model_ss   -type cbayes   -ng 1
>> -source hdfs
>>
>> => 66% accuracy
>>
>> and for weka
>> # create the .arff file from 20news_ss train and test:
>> start the file with appropriate header:
>> -----
>> @relation _home_benjamin_Data_BY_weka
>>
>> @attribute text string
>> @attribute class
>> {alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med}
>>
>> @data
>> -----
>> # then past the data:
>> cat ~/workspace/mahout-0.5/examples/bin/work/20news_ss/bayes-test-input/* |
>> perl mh2arff.pl >> 20news_ss_test.arff
>> # with m2harff.pl :
>> ----
>> use strict;
>> while(<STDIN>) {
>>   chomp;
>>   $_ =~ s/\'/\\\'/g;
>>   $_ =~ s/ $//;
>>   my ($c,$t) = split("\t",$_);
>>   print "'$t',$c\n";
>> }
>> ---
>> # and the train/test command:
>> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
>> 20news_ss_test.arff -F
>> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
>> weka.classifiers.bayes.NaiveBayesMultinomial
>>
>> => 92% accuracy
>>
>>
>>
>>
>>
>>
>> 2011/9/16 Robin Anil <robin.anil@gmail.com>
>>
>>> Did you try complementary naive bayes(CNB). I am guessing the multinomial
>>> naivebayes mentioned here is a CNB like implementation and not NB.
>>>
>>>
>>> On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <
>> benjamin.rey@c-optimal.com
>>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm giving a try to different classifiers for a classical problem of
>> text
>>>> classification very close to the 20newsgroup one.
>>>> I end up with much better results with Weka NaiveBayesMultinomial than
>>> with
>>>> Mahout bayes.
>>>> The main problem comes from the fact that my data is unbalanced. I know
>>>> bayes has difficulties with that yet I'm surprised by the difference
>>>> between
>>>> weka and mahout.
>>>>
>>>> I went back to the 20newsgroup example, picked 5 classes only and
>>>> subsampled
>>>> those to get 5 classes with 400 200 100 100 and 30 examples and pretty
>>> much
>>>> the same for test set.
>>>> On mahout with bayes 1-gram, I'm getting 66% correctly classified (see
>>>> below
>>>> for confusion matrix)
>>>> On weka, on the same exact data, without any tuning, I'm getting 92%
>>>> correctly classified.
>>>>
>>>> Would anyone know where the difference comes from and if there are ways
>> I
>>>> could tune Mahout to get better results? my data is  small enough for
>> now
>>>> for weka but this won't last.
>>>>
>>>> Many thanks
>>>>
>>>> Benjamin.
>>>>
>>>>
>>>>
>>>> MAHOUT:
>>>> -------------------------------------------------------
>>>> Correctly Classified Instances          :        491       65.5541%
>>>> Incorrectly Classified Instances        :        258       34.4459%
>>>> Total Classified Instances              :        749
>>>>
>>>> =======================================================
>>>> Confusion Matrix
>>>> -------------------------------------------------------
>>>> a        b        c        d        e        f        <--Classified as
>>>> 14       82       0        4        0        0         |  100       a
>>> =
>>>> rec.sport.hockey
>>>> 0        319      0        0        0        0         |  319       b
>>> =
>>>> alt.atheism
>>>> 0        88       3        9        0        0         |  100       c
>>> =
>>>> rec.autos
>>>> 0        45       0        155      0        0         |  200       d
>>> =
>>>> comp.graphics
>>>> 0        25       0        5        0        0         |  30        e
>>> =
>>>> sci.med
>>>> 0        0        0        0        0        0         |  0         f
>>> =
>>>> unknown
>>>> Default Category: unknown: 5
>>>>
>>>>
>>>> WEKA:
>>>> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff
>> -T
>>>> 20news_ss_test.arff -F
>>>> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
>>>> weka.classifiers.bayes.NaiveBayesMultinomial
>>>>
>>>> === Error on test data ===
>>>>
>>>> Correctly Classified Instances         688               91.8558 %
>>>> Incorrectly Classified Instances        61                8.1442 %
>>>> Kappa statistic                          0.8836
>>>> Mean absolute error                      0.0334
>>>> Root mean squared error                  0.1706
>>>> Relative absolute error                 11.9863 %
>>>> Root relative squared error             45.151  %
>>>> Total Number of Instances              749
>>>>
>>>>
>>>> === Confusion Matrix ===
>>>>
>>>>  a   b   c   d   e   <-- classified as
>>>> 308   9   2   0   0 |   a = alt.atheism
>>>>  5 195   0   0   0 |   b = comp.graphics
>>>>  3  11  84   2   0 |   c = rec.autos
>>>>  3   3   0  94   0 |   d = rec.sport.hockey
>>>>  6  11   6   0   7 |   e = sci.med
>>>>
>>>
>>

Mime
View raw message