mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Document size rules of thumb
Date Thu, 08 Oct 2009 18:28:31 GMT
With that many class and so few training examples, you really need to use a
fancier training algorithm.  You also need to have some sense of partial
credit in your scoring.  One way to do that is to sort results and measure
what rank the correct class had on average.  Another way is to produce
probabilities for each class and measure the average log of the probability
of the correct class.

Put another way, suppose that two of your classes are very, very similar and
the nearly correct class is listed first and the correct class is second
(out of 500+ classes).  How bad is that?  On the other hand, suppose that
the two classes were buried 300 deep in the list of likely class
assignments.  Isn't that worse?

Your evaluation should reflect that intuition, especially when you have so
many classes.

On Thu, Oct 8, 2009 at 6:38 AM, Sandra Clover <sclover@consultant.com>wrote:

> Thanks for the tip Robin - I was wondering what was the difference
> between the 2 but was unable to find anything on them. On this topic is
> there anything else I should be aware of between the 2 models? Bayes
> Algorithm: good for ??CBayes algorithm: good for multiclass
> classification (categories > 2)
>
>  ----- Original Message -----
>  From: "Robin Anil"
>  To: mahout-user@lucene.apache.org
>  Subject: Re: Document size rules of thumb
>   Date: Thu, 8 Oct 2009 13:39:20 +0530
>
>
>  one more tip: You will encounter better results with cbayes algorithm
>  instead of bayes algorithm for multiclass
>  classification(categories>2)
>
>   On Thu, Oct 8, 2009 at 1:37 PM, Robin Anil wrote:
>
>  >
>  >
>  > On Thu, Oct 8, 2009 at 1:33 PM, Sandra Clover wrote:
>  >
>  >> Hi Ted, Thanks for the response. To answer your questions: 1. I
>  have
>  >> 576 categories2. I started with 5 training document per category.
>  Went up
>  >> to 10 but error levels ramained the same. Am going to up to 30
>  documents
>  >> and am going to increase the length of the documents. How did you
>  derive
>  >> the 50 words of training data for some topics? Curious... S.
>  >>
>  >>
>  > 30 documents is too less if words overlap across categories and you
>  dont
>  > have enought discriminative words for each categories.
>  >
>  > Again with 576 categories you need really good discriminative words
>  in each
>  > category to be able to cover all the unknown documents you wish to
>  classify
>  >
>  > ----- Original Message -----
>  >> From: "Ted Dunning"
>  >> To: mahout-user@lucene.apache.org
>  >> Subject: Re: Document size rules of thumb
>  >> Date: Wed, 7 Oct 2009 10:21:20 -0700
>  >>
>  >>
>  >> Sandra,
>  >>
>  >> This is a classic case of over-fitting. I suspect training data
>  >> inadequacy. One thing you don't say is how many categories you
>  have
>  >> and how
>  >> many training documents per categories you have. You point (2)
>  might
>  >> indicate that you have as little as 50 words of training data for
>  >> some
>  >> topics. That would make it difficult for even the best classifiers
>  to
>  >> get a
>  >> sharp result.
>  >>
>  >> I would recommend the following:
>  >>
>  >> a) get more training data (always a good thing even if often
>  >> infeasible)
>  >>
>  >> b) try a few other algorithms. I would recommend trying Luduan
>  (from
>  >> my
>  >> dissertation, pdf sent to you in a separate email), confidence
>  >> weighted
>  >> learning (see http://www.cs.jhu.edu/~mdredze/publications/<http://www.cs.jhu.edu/%7Emdredze/publications/>
> ,
>  >> especially
>  >> http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal
>  (
>  >> http://hunch.net/~vw/ <http://hunch.net/%7Evw/>)
>  >>
>  >> c) post your data for others to try
>  >>
>  >> Hope this helps.
>  >>
>  >> On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote:
>  >>
>  >> > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am
>  using a
>  >> > branch version. Currently trying to install the trunk version
>  >> >
>  >> > 1. The data I am trying to classify is from scientific papers -
>  >> > essentially the abstract title, text and keywords of there paper
>  -
>  >> > example below
>  >> >
>  >> > 2. No data source is under 300 characters
>  >> >
>  >> > 3. I am training using the Mahout naive Bayes and am getting low
>  >> > incorrectly classified rates something like: 1.67% - I’m
>  quite
>  >> happy
>  >> > with that…
>  >> >
>  >> > 4. After I have trained the model Robin I use the Mahout naive
>  >> Bayes
>  >> > classify() method to classify new (unseen) data (with the
>  >> classification
>  >> > already known) - this is where I start to get problems - I get
>  very
>  >> poor
>  >> > successful classification rates for new data. Something like:
>  82%
>  >> > unsuccessful classified.
>  >> >
>  >> >
>  >> >
>  >> > To Summarise: I get very good results in training and very poor
>  >> results
>  >> > with new data.
>  >> >
>  >>
>  >>
>  >>
>  >> --
>  >> Ted Dunning, CTO
>  >> DeepDyve
>  >>
>  >> --
>  >> Be Yourself @ mail.com!
>  >> Choose From 200+ Email Addresses
>  >> Get a Free Account at www.mail.com!
>  >>
>  >>
>  >
>
> --
> Be Yourself @ mail.com!
> Choose From 200+ Email Addresses
> Get a Free Account at www.mail.com!
>
>


-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message