mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philippe Lamarche" <philippe.lamar...@gmail.com>
Subject Re: Problems with the Bayesian classifiers.
Date Mon, 21 Jul 2008 23:42:08 GMT
 Hi,

I just tried it with Mallet.

http://mallet.cs.umass.edu/index.php/Main_Page

I used the same training and testing files (on the 20News corpus) and
got an 85% prediction accuracy.

However, I also tired it on Mallet with my usual Enron corpus and only
got a 50% accuracy.

I would say that there is probably something wrong with the Mahout
classifier implementation. Also, probably that the training data that
I use with the Enron data-set is not distinct enough to be used with a
Bayesian classifier.

Any ideas?

Thanks,

Philippe.


On Sun, Jul 20, 2008 at 11:23 AM, Philippe Lamarche
<philippe.lamarche@gmail.com> wrote:
>  Hi,
>
> I uploaded my split here:
>
> http://www.2shared.com/file/3624998/e9330a64/news-train-testtar.html
>
> (the download link is after all the ads, at the bottom of the page)
>
> The file contains the "news_test_1" and "news_train_1" folders, with
> the original file/folder structure. The "news_ha_train_1" folder
> contains the collapse version of "news_train_1".
>
> The training files are not perfectly distributed in each class (some
> class will contain less training file than other). This was done to
> reflect the UC Berkeley Enron corpus.
>
> Thanks,
> Philippe.
>
>
> On Sun, Jul 20, 2008 at 10:08 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>> I haven't done a lot of testing w/ M-9 yet, so it is more than likely there
>> are bugs ;-)
>>
>> -Grant
>>
>> On Jul 20, 2008, at 6:21 AM, Miles Osborne wrote:
>>
>>> i think it would also be useful to cross-check your results against a text
>>> classification system which is known to work.  look at rainbow:
>>>
>>> http://www.cs.cmu.edu/~mccallum/bow/rainbow/
>>>
>>> if you get the correct results here then either you have somehow messed-up
>>> with Mahout or else there really is a bug
>>>
>>> Miles
>>>
>>> 2008/7/20 Robin Anil <robin.anil@gmail.com>:
>>>
>>>> Can you upload your split somewhere.
>>>>
>>>> On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche <
>>>> philippe.lamarche@gmail.com> wrote:
>>>>
>>>>> Now, with the attachment.
>>>>> Sorry.
>>>>>
>>>>> On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche
>>>>> <philippe.lamarche@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have been working for a little while with Mahout and the Bayesian
>>>>>> classifier for a school project.
>>>>>>
>>>>>> I am using the Enron email corpus and the UC Berkeley classified
>>>>>> emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/><
>>>>
>>>> http://www.cs.cmu.edu/%7Eenron/>).
>>>>>
>>>>> I did a few tests and I can't
>>>>>>
>>>>>> seem to make it work. I wonder if I am doing something wrong.
>>>>>>
>>>>>> For example, I am getting correct prediction under 10%, with Bayes
and
>>>>>> around 1% with CBayes. The problem seems to lie in the fact that
all
>>>>>> instances of a class will be predicted to another class, or that
they
>>>>>> will all be predicted to the class containing the more feature.
>>>>>>
>>>>>> I also tested with the 20News corpus and I get similar result where
>>>>>> all instances of a class will be predicted to another class. (e.g.
all
>>>>>> 421 "rec.motorcycles" get predicted as "talk.politics.mideast").
>>>>>> Attached is two confusions matrix displaying results for bayes and
>>>>>> cbayes. Both used the same division in the training and testing set.
>>>>>>
>>>>>> Am I doing something wrong?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Philippe Lamarche.
>>>>>>
>>>>>
>>>>
>>>>
>>>> Thanks
>>>> Robin
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in Scotland,
>>> with registration number SC005336.
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message