mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Document size rules of thumb
Date Thu, 08 Oct 2009 08:09:20 GMT
one more tip: You will encounter better results with cbayes algorithm
instead of bayes algorithm for multiclass classification(categories>2)

On Thu, Oct 8, 2009 at 1:37 PM, Robin Anil <robin.anil@gmail.com> wrote:

>
>
> On Thu, Oct 8, 2009 at 1:33 PM, Sandra Clover <sclover@consultant.com>wrote:
>
>> Hi Ted,    Thanks for the response. To answer your questions: 1. I have
>> 576 categories2. I started with 5 training document per category. Went up
>> to 10 but error levels ramained the same. Am going to up to 30 documents
>> and am going to increase the length of the documents.  How did you derive
>> the 50 words of training data for some topics? Curious... S.
>>
>>
> 30 documents is too less if words overlap across categories and  you dont
> have enought discriminative words for each categories.
>
> Again with 576 categories you need really good discriminative words in each
> category to be able to cover all the unknown documents you wish to classify
>
>  ----- Original Message -----
>>  From: "Ted Dunning"
>>  To: mahout-user@lucene.apache.org
>>  Subject: Re: Document size rules of thumb
>>   Date: Wed, 7 Oct 2009 10:21:20 -0700
>>
>>
>>  Sandra,
>>
>>  This is a classic case of over-fitting. I suspect training data
>>  inadequacy. One thing you don't say is how many categories you have
>>  and how
>>  many training documents per categories you have. You point (2) might
>>  indicate that you have as little as 50 words of training data for
>>  some
>>  topics. That would make it difficult for even the best classifiers to
>>  get a
>>  sharp result.
>>
>>  I would recommend the following:
>>
>>  a) get more training data (always a good thing even if often
>>  infeasible)
>>
>>  b) try a few other algorithms. I would recommend trying Luduan (from
>>  my
>>  dissertation, pdf sent to you in a separate email), confidence
>>  weighted
>>  learning (see http://www.cs.jhu.edu/~mdredze/publications/,
>>  especially
>>  http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal (
>>  http://hunch.net/~vw/)
>>
>>  c) post your data for others to try
>>
>>  Hope this helps.
>>
>>   On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote:
>>
>>  > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am using
a
>>   > branch version. Currently trying to install the trunk version
>>  >
>>  > 1. The data I am trying to classify is from scientific papers -
>>  > essentially the abstract title, text and keywords of there paper -
>>  > example below
>>  >
>>  > 2. No data source is under 300 characters
>>  >
>>  > 3. I am training using the Mahout naive Bayes and am getting low
>>   > incorrectly classified rates something like: 1.67% - I’m quite
>>  happy
>>  > with that…
>>   >
>>  > 4. After I have trained the model Robin I use the Mahout naive
>>  Bayes
>>  > classify() method to classify new (unseen) data (with the
>>  classification
>>  > already known) - this is where I start to get problems - I get very
>>  poor
>>  > successful classification rates for new data. Something like: 82%
>>  > unsuccessful classified.
>>  >
>>  >
>>  >
>>  > To Summarise: I get very good results in training and very poor
>>  results
>>  > with new data.
>>  >
>>
>>
>>
>>  --
>>  Ted Dunning, CTO
>>  DeepDyve
>>
>> --
>> Be Yourself @ mail.com!
>> Choose From 200+ Email Addresses
>> Get a Free Account at www.mail.com!
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message