mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Document size rules of thumb
Date Thu, 08 Oct 2009 08:07:16 GMT
On Thu, Oct 8, 2009 at 1:33 PM, Sandra Clover <sclover@consultant.com>wrote:

> Hi Ted,    Thanks for the response. To answer your questions: 1. I have
> 576 categories2. I started with 5 training document per category. Went up
> to 10 but error levels ramained the same. Am going to up to 30 documents
> and am going to increase the length of the documents.  How did you derive
> the 50 words of training data for some topics? Curious... S.
>
>
30 documents is too less if words overlap across categories and  you dont
have enought discriminative words for each categories.

Again with 576 categories you need really good discriminative words in each
category to be able to cover all the unknown documents you wish to classify

 ----- Original Message -----
>  From: "Ted Dunning"
>  To: mahout-user@lucene.apache.org
>  Subject: Re: Document size rules of thumb
>   Date: Wed, 7 Oct 2009 10:21:20 -0700
>
>
>  Sandra,
>
>  This is a classic case of over-fitting. I suspect training data
>  inadequacy. One thing you don't say is how many categories you have
>  and how
>  many training documents per categories you have. You point (2) might
>  indicate that you have as little as 50 words of training data for
>  some
>  topics. That would make it difficult for even the best classifiers to
>  get a
>  sharp result.
>
>  I would recommend the following:
>
>  a) get more training data (always a good thing even if often
>  infeasible)
>
>  b) try a few other algorithms. I would recommend trying Luduan (from
>  my
>  dissertation, pdf sent to you in a separate email), confidence
>  weighted
>  learning (see http://www.cs.jhu.edu/~mdredze/publications/,
>  especially
>  http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal (
>  http://hunch.net/~vw/)
>
>  c) post your data for others to try
>
>  Hope this helps.
>
>   On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote:
>
>  > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am using a
>   > branch version. Currently trying to install the trunk version
>  >
>  > 1. The data I am trying to classify is from scientific papers -
>  > essentially the abstract title, text and keywords of there paper -
>  > example below
>  >
>  > 2. No data source is under 300 characters
>  >
>  > 3. I am training using the Mahout naive Bayes and am getting low
>   > incorrectly classified rates something like: 1.67% - I’m quite
>  happy
>  > with that…
>   >
>  > 4. After I have trained the model Robin I use the Mahout naive
>  Bayes
>  > classify() method to classify new (unseen) data (with the
>  classification
>  > already known) - this is where I start to get problems - I get very
>  poor
>  > successful classification rates for new data. Something like: 82%
>  > unsuccessful classified.
>  >
>  >
>  >
>  > To Summarise: I get very good results in training and very poor
>  results
>  > with new data.
>  >
>
>
>
>  --
>  Ted Dunning, CTO
>  DeepDyve
>
> --
> Be Yourself @ mail.com!
> Choose From 200+ Email Addresses
> Get a Free Account at www.mail.com!
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message