mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Document size rules of thumb
Date Thu, 08 Oct 2009 13:52:13 GMT
thats the only diff between the two.

On Thu, Oct 8, 2009 at 7:08 PM, Sandra Clover <sclover@consultant.com>wrote:

> Thanks for the tip Robin - I was wondering what was the difference
> between the 2 but was unable to find anything on them. On this topic is
> there anything else I should be aware of between the 2 models? Bayes
> Algorithm: good for ??CBayes algorithm: good for multiclass
> classification (categories > 2)
>
>  ----- Original Message -----
>  From: "Robin Anil"
>   To: mahout-user@lucene.apache.org
>  Subject: Re: Document size rules of thumb
>   Date: Thu, 8 Oct 2009 13:39:20 +0530
>
>
>  one more tip: You will encounter better results with cbayes algorithm
>  instead of bayes algorithm for multiclass
>  classification(categories>2)
>
>   On Thu, Oct 8, 2009 at 1:37 PM, Robin Anil wrote:
>
>  >
>  >
>  > On Thu, Oct 8, 2009 at 1:33 PM, Sandra Clover wrote:
>  >
>  >> Hi Ted, Thanks for the response. To answer your questions: 1. I
>  have
>  >> 576 categories2. I started with 5 training document per category.
>  Went up
>  >> to 10 but error levels ramained the same. Am going to up to 30
>  documents
>  >> and am going to increase the length of the documents. How did you
>  derive
>  >> the 50 words of training data for some topics? Curious... S.
>  >>
>  >>
>  > 30 documents is too less if words overlap across categories and you
>  dont
>  > have enought discriminative words for each categories.
>  >
>  > Again with 576 categories you need really good discriminative words
>  in each
>  > category to be able to cover all the unknown documents you wish to
>  classify
>  >
>  > ----- Original Message -----
>  >> From: "Ted Dunning"
>  >> To: mahout-user@lucene.apache.org
>  >> Subject: Re: Document size rules of thumb
>  >> Date: Wed, 7 Oct 2009 10:21:20 -0700
>  >>
>  >>
>  >> Sandra,
>  >>
>  >> This is a classic case of over-fitting. I suspect training data
>  >> inadequacy. One thing you don't say is how many categories you
>  have
>  >> and how
>  >> many training documents per categories you have. You point (2)
>  might
>  >> indicate that you have as little as 50 words of training data for
>  >> some
>  >> topics. That would make it difficult for even the best classifiers
>  to
>  >> get a
>  >> sharp result.
>  >>
>  >> I would recommend the following:
>  >>
>  >> a) get more training data (always a good thing even if often
>  >> infeasible)
>  >>
>  >> b) try a few other algorithms. I would recommend trying Luduan
>  (from
>  >> my
>  >> dissertation, pdf sent to you in a separate email), confidence
>  >> weighted
>  >> learning (see http://www.cs.jhu.edu/~mdredze/publications/,
>  >> especially
>  >> http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal
>  (
>  >> http://hunch.net/~vw/)
>  >>
>  >> c) post your data for others to try
>  >>
>  >> Hope this helps.
>  >>
>  >> On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote:
>  >>
>  >> > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am
>  using a
>  >> > branch version. Currently trying to install the trunk version
>  >> >
>  >> > 1. The data I am trying to classify is from scientific papers -
>  >> > essentially the abstract title, text and keywords of there paper
>  -
>  >> > example below
>  >> >
>  >> > 2. No data source is under 300 characters
>  >> >
>  >> > 3. I am training using the Mahout naive Bayes and am getting low
>  >> > incorrectly classified rates something like: 1.67% - I’m
>  quite
>  >> happy
>  >> > with that…
>  >> >
>  >> > 4. After I have trained the model Robin I use the Mahout naive
>  >> Bayes
>  >> > classify() method to classify new (unseen) data (with the
>  >> classification
>  >> > already known) - this is where I start to get problems - I get
>  very
>  >> poor
>  >> > successful classification rates for new data. Something like:
>  82%
>  >> > unsuccessful classified.
>  >> >
>  >> >
>  >> >
>  >> > To Summarise: I get very good results in training and very poor
>  >> results
>  >> > with new data.
>  >> >
>  >>
>  >>
>  >>
>  >> --
>  >> Ted Dunning, CTO
>  >> DeepDyve
>  >>
>  >> --
>  >> Be Yourself @ mail.com!
>  >> Choose From 200+ Email Addresses
>  >> Get a Free Account at www.mail.com!
>  >>
>  >>
>  >
>
> --
> Be Yourself @ mail.com!
> Choose From 200+ Email Addresses
> Get a Free Account at www.mail.com!
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message