Thanks a lot for your interest and time.
I'm computerless for the coming week, but I'll run a few more
experiments and post the data as soon as I'm back home.
Thanks.
Benjamin
Le 17 sept. 2011 à 00:24, Ted Dunning <ted.dunning@gmail.com> a écrit :
> Benjamin,
>
> Can you post your actual training data on dropbox or some other place so
> that we can replicate the problem?
>
> On Fri, Sep 16, 2011 at 3:38 PM, Benjamin Rey <benjamin.rey@coptimal.com>wrote:
>
>> Unfortunately CNB gives me the same 66% accuracy.
>>
>> I past the commands for mahout and weka below.
>>
>> I also tried to remove the biggest class, it helps but then it's the 2nd
>> biggest class that is overwhelmingly predicted. Mahout bayes seems to favor
>> a lot the biggest class (more than prior), contrarily to Weka's
>> implementation. Is there any choice in the parameters, or in ways of
>> computing weights that could be causing this?
>>
>> thanks.
>>
>> benjamin
>>
>> here are the commands:
>> On Mahout:
>> # training set, usual prepare20newsgroup, followed by a subsampling for
>> just
>> a few classes
>> bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups p
>> examples/bin/work/20newsbydate/20newsbydatetrain o
>> examples/bin/work/20newsbydate/bayestraininput a
>> org.apache.mahout.vectorizer.DefaultAnalyzer c UTF8
>> mkdir examples/bin/work/20news_ss/bayestraininput/
>> head 400 examples/bin/work/20newsbydate/bayestraininput/alt.atheism.txt
>>> examples/bin/work/20news_ss/bayestraininput/alt.atheism.txt
>> head 200
>> examples/bin/work/20newsbydate/bayestraininput/comp.graphics.txt >
>> examples/bin/work/20news_ss/bayestraininput/comp.graphics.txt
>> head 100 examples/bin/work/20newsbydate/bayestraininput/rec.autos.txt >
>> examples/bin/work/20news_ss/bayestraininput/rec.autos.txt
>> head 100
>> examples/bin/work/20newsbydate/bayestraininput/rec.sport.hockey.txt >
>> examples/bin/work/20news_ss/bayestraininput/rec.sport.hockey.txt
>> head 30 examples/bin/work/20newsbydate/bayestraininput/sci.med.txt >
>> examples/bin/work/20news_ss/bayestraininput/sci.med.txt
>> hdput examples/bin/work/20news_ss/bayestraininput
>> examples/bin/work/20news_ss/bayestraininput
>>
>> then same exact thing for testing
>>
>> # actual training:
>> bin/mahout trainclassifier i
>> examples/bin/work/20news_ss/bayestraininput o
>> examples/bin/work/20newsbydate/cbayesmodel_ss type cbayes ng 1
>> source hdfs
>>
>> # testing
>> bin/mahout testclassifier d
>> examples/bin/work/20news_ss/bayestestinput m
>> examples/bin/work/20newsbydate/cbayesmodel_ss type cbayes ng 1
>> source hdfs
>>
>> => 66% accuracy
>>
>> and for weka
>> # create the .arff file from 20news_ss train and test:
>> start the file with appropriate header:
>> 
>> @relation _home_benjamin_Data_BY_weka
>>
>> @attribute text string
>> @attribute class
>> {alt.atheism,comp.graphics,rec.autos,rec.sport.hockey,sci.med}
>>
>> @data
>> 
>> # then past the data:
>> cat ~/workspace/mahout0.5/examples/bin/work/20news_ss/bayestestinput/* 
>> perl mh2arff.pl >> 20news_ss_test.arff
>> # with m2harff.pl :
>> 
>> use strict;
>> while(<STDIN>) {
>> chomp;
>> $_ =~ s/\'/\\\'/g;
>> $_ =~ s/ $//;
>> my ($c,$t) = split("\t",$_);
>> print "'$t',$c\n";
>> }
>> 
>> # and the train/test command:
>> java weka.classifiers.meta.FilteredClassifier t 20news_ss_train.arff T
>> 20news_ss_test.arff F
>> "weka.filters.unsupervised.attribute.StringToWordVector S" W
>> weka.classifiers.bayes.NaiveBayesMultinomial
>>
>> => 92% accuracy
>>
>>
>>
>>
>>
>>
>> 2011/9/16 Robin Anil <robin.anil@gmail.com>
>>
>>> Did you try complementary naive bayes(CNB). I am guessing the multinomial
>>> naivebayes mentioned here is a CNB like implementation and not NB.
>>>
>>>
>>> On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey <
>> benjamin.rey@coptimal.com
>>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm giving a try to different classifiers for a classical problem of
>> text
>>>> classification very close to the 20newsgroup one.
>>>> I end up with much better results with Weka NaiveBayesMultinomial than
>>> with
>>>> Mahout bayes.
>>>> The main problem comes from the fact that my data is unbalanced. I know
>>>> bayes has difficulties with that yet I'm surprised by the difference
>>>> between
>>>> weka and mahout.
>>>>
>>>> I went back to the 20newsgroup example, picked 5 classes only and
>>>> subsampled
>>>> those to get 5 classes with 400 200 100 100 and 30 examples and pretty
>>> much
>>>> the same for test set.
>>>> On mahout with bayes 1gram, I'm getting 66% correctly classified (see
>>>> below
>>>> for confusion matrix)
>>>> On weka, on the same exact data, without any tuning, I'm getting 92%
>>>> correctly classified.
>>>>
>>>> Would anyone know where the difference comes from and if there are ways
>> I
>>>> could tune Mahout to get better results? my data is small enough for
>> now
>>>> for weka but this won't last.
>>>>
>>>> Many thanks
>>>>
>>>> Benjamin.
>>>>
>>>>
>>>>
>>>> MAHOUT:
>>>> 
>>>> Correctly Classified Instances : 491 65.5541%
>>>> Incorrectly Classified Instances : 258 34.4459%
>>>> Total Classified Instances : 749
>>>>
>>>> =======================================================
>>>> Confusion Matrix
>>>> 
>>>> a b c d e f <Classified as
>>>> 14 82 0 4 0 0  100 a
>>> =
>>>> rec.sport.hockey
>>>> 0 319 0 0 0 0  319 b
>>> =
>>>> alt.atheism
>>>> 0 88 3 9 0 0  100 c
>>> =
>>>> rec.autos
>>>> 0 45 0 155 0 0  200 d
>>> =
>>>> comp.graphics
>>>> 0 25 0 5 0 0  30 e
>>> =
>>>> sci.med
>>>> 0 0 0 0 0 0  0 f
>>> =
>>>> unknown
>>>> Default Category: unknown: 5
>>>>
>>>>
>>>> WEKA:
>>>> java weka.classifiers.meta.FilteredClassifier t 20news_ss_train.arff
>> T
>>>> 20news_ss_test.arff F
>>>> "weka.filters.unsupervised.attribute.StringToWordVector S" W
>>>> weka.classifiers.bayes.NaiveBayesMultinomial
>>>>
>>>> === Error on test data ===
>>>>
>>>> Correctly Classified Instances 688 91.8558 %
>>>> Incorrectly Classified Instances 61 8.1442 %
>>>> Kappa statistic 0.8836
>>>> Mean absolute error 0.0334
>>>> Root mean squared error 0.1706
>>>> Relative absolute error 11.9863 %
>>>> Root relative squared error 45.151 %
>>>> Total Number of Instances 749
>>>>
>>>>
>>>> === Confusion Matrix ===
>>>>
>>>> a b c d e < classified as
>>>> 308 9 2 0 0  a = alt.atheism
>>>> 5 195 0 0 0  b = comp.graphics
>>>> 3 11 84 2 0  c = rec.autos
>>>> 3 3 0 94 0  d = rec.sport.hockey
>>>> 6 11 6 0 7  e = sci.med
>>>>
>>>
>>
