Thanks a lot for your interest and time.
I'm computer-less for the coming week, but I'll run a few more
experiments and post the data as soon as I'm back home.
Thanks.
Benjamin
Le 17 sept. 2011 à 00:24, Ted Dunning a écrit :
> Benjamin,
>
> Can you post your actual training data on dropbox or some other place so
> that we can replicate the problem?
>
> On Fri, Sep 16, 2011 at 3:38 PM, Benjamin Rey wrote:
>
>> Unfortunately CNB gives me the same 66% accuracy.
>>
>> I past the commands for mahout and weka below.
>>
>> I also tried to remove the biggest class, it helps but then it's the 2nd
>> biggest class that is overwhelmingly predicted. Mahout bayes seems to favor
>> a lot the biggest class (more than prior), contrarily to Weka's
>> implementation. Is there any choice in the parameters, or in ways of
>> computing weights that could be causing this?
>>
>> thanks.
>>
>> benjamin
>>
>> here are the commands:
>> On Mahout:
>> # training set, usual prepare20newsgroup, followed by a subsampling for
>> just
>> a few classes
>> bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p
>> examples/bin/work/20news-bydate/20news-bydate-train -o
>> examples/bin/work/20news-bydate/bayes-train-input -a
>> org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
>> mkdir examples/bin/work/20news_ss/bayes-train-input/
>> head -400 examples/bin/work/20news-bydate/bayes-train-input/alt.atheism.txt
>>> examples/bin/work/20news_ss/bayes-train-input/alt.atheism.txt
>> head -200
>> examples/bin/work/20news-bydate/bayes-train-input/comp.graphics.txt >
>> examples/bin/work/20news_ss/bayes-train-input/comp.graphics.txt
>> head -100 examples/bin/work/20news-bydate/bayes-train-input/rec.autos.txt >
>> examples/bin/work/20news_ss/bayes-train-input/rec.autos.txt
>> head -100
>> examples/bin/work/20news-bydate/bayes-train-input/rec.sport.hockey.txt >
>> examples/bin/work/20news_ss/bayes-train-input/rec.sport.hockey.txt
>> head -30 examples/bin/work/20news-bydate/bayes-train-input/sci.med.txt >
>> examples/bin/work/20news_ss/bayes-train-input/sci.med.txt
>> hdput examples/bin/work/20news_ss/bayes-train-input
>> examples/bin/work/20news_ss/bayes-train-input
>>
>> then same exact thing for testing
>>
>> # actual training:
>> bin/mahout trainclassifier -i
>> examples/bin/work/20news_ss/bayes-train-input -o
>> examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1
>> -source hdfs
>>
>> # testing
>> bin/mahout testclassifier -d
>> examples/bin/work/20news_ss/bayes-test-input -m
>> examples/bin/work/20news-bydate/cbayes-model_ss -type cbayes -ng 1
>> -source hdfs
>>
>> => 66% accuracy
>>
>> and for weka
>> # create the .arff file from 20news_ss train and test:
>> start the file with appropriate header:
>> -----
>> @relation _home_benjamin_Data_BY_weka
>>
>> @attribute text string
>> @attribute class
>> {alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med}
>>
>> @data
>> -----
>> # then past the data:
>> cat ~/workspace/mahout-0.5/examples/bin/work/20news_ss/bayes-test-input/* |
>> perl mh2arff.pl >> 20news_ss_test.arff
>> # with m2harff.pl :
>> ----
>> use strict;
>> while() {
>> chomp;
>> $_ =~ s/\'/\\\'/g;
>> $_ =~ s/ $//;
>> my ($c,$t) = split("\t",$_);
>> print "'$t',$c\n";
>> }
>> ---
>> # and the train/test command:
>> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff -T
>> 20news_ss_test.arff -F
>> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
>> weka.classifiers.bayes.NaiveBayesMultinomial
>>
>> => 92% accuracy
>>
>>
>>
>>
>>
>>
>> 2011/9/16 Robin Anil
>>
>>> Did you try complementary naive bayes(CNB). I am guessing the multinomial
>>> naivebayes mentioned here is a CNB like implementation and not NB.
>>>
>>>
>>> On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <
>> benjamin.rey@c-optimal.com
>>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm giving a try to different classifiers for a classical problem of
>> text
>>>> classification very close to the 20newsgroup one.
>>>> I end up with much better results with Weka NaiveBayesMultinomial than
>>> with
>>>> Mahout bayes.
>>>> The main problem comes from the fact that my data is unbalanced. I know
>>>> bayes has difficulties with that yet I'm surprised by the difference
>>>> between
>>>> weka and mahout.
>>>>
>>>> I went back to the 20newsgroup example, picked 5 classes only and
>>>> subsampled
>>>> those to get 5 classes with 400 200 100 100 and 30 examples and pretty
>>> much
>>>> the same for test set.
>>>> On mahout with bayes 1-gram, I'm getting 66% correctly classified (see
>>>> below
>>>> for confusion matrix)
>>>> On weka, on the same exact data, without any tuning, I'm getting 92%
>>>> correctly classified.
>>>>
>>>> Would anyone know where the difference comes from and if there are ways
>> I
>>>> could tune Mahout to get better results? my data is small enough for
>> now
>>>> for weka but this won't last.
>>>>
>>>> Many thanks
>>>>
>>>> Benjamin.
>>>>
>>>>
>>>>
>>>> MAHOUT:
>>>> -------------------------------------------------------
>>>> Correctly Classified Instances : 491 65.5541%
>>>> Incorrectly Classified Instances : 258 34.4459%
>>>> Total Classified Instances : 749
>>>>
>>>> =======================================================
>>>> Confusion Matrix
>>>> -------------------------------------------------------
>>>> a b c d e f <--Classified as
>>>> 14 82 0 4 0 0 | 100 a
>>> =
>>>> rec.sport.hockey
>>>> 0 319 0 0 0 0 | 319 b
>>> =
>>>> alt.atheism
>>>> 0 88 3 9 0 0 | 100 c
>>> =
>>>> rec.autos
>>>> 0 45 0 155 0 0 | 200 d
>>> =
>>>> comp.graphics
>>>> 0 25 0 5 0 0 | 30 e
>>> =
>>>> sci.med
>>>> 0 0 0 0 0 0 | 0 f
>>> =
>>>> unknown
>>>> Default Category: unknown: 5
>>>>
>>>>
>>>> WEKA:
>>>> java weka.classifiers.meta.FilteredClassifier -t 20news_ss_train.arff
>> -T
>>>> 20news_ss_test.arff -F
>>>> "weka.filters.unsupervised.attribute.StringToWordVector -S" -W
>>>> weka.classifiers.bayes.NaiveBayesMultinomial
>>>>
>>>> === Error on test data ===
>>>>
>>>> Correctly Classified Instances 688 91.8558 %
>>>> Incorrectly Classified Instances 61 8.1442 %
>>>> Kappa statistic 0.8836
>>>> Mean absolute error 0.0334
>>>> Root mean squared error 0.1706
>>>> Relative absolute error 11.9863 %
>>>> Root relative squared error 45.151 %
>>>> Total Number of Instances 749
>>>>
>>>>
>>>> === Confusion Matrix ===
>>>>
>>>> a b c d e <-- classified as
>>>> 308 9 2 0 0 | a = alt.atheism
>>>> 5 195 0 0 0 | b = comp.graphics
>>>> 3 11 84 2 0 | c = rec.autos
>>>> 3 3 0 94 0 | d = rec.sport.hockey
>>>> 6 11 6 0 7 | e = sci.med
>>>>
>>>
>>