mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Document size rules of thumb
Date Wed, 07 Oct 2009 17:21:20 GMT

This is a classic case of over-fitting.  I suspect training data
inadequacy.  One thing you don't say is how many categories you have and how
many training documents per categories you have.  You point (2) might
indicate that you have as little as 50 words of training data for some
topics.  That would make it difficult for even the best classifiers to get a
sharp result.

I would recommend the following:

a) get more training data (always a good thing even if often infeasible)

b) try a few other algorithms.  I would recommend trying Luduan (from my
dissertation, pdf sent to you in a separate email), confidence weighted
learning (see, especially and vowpal (

c) post your data for others to try

Hope this helps.

On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover <>wrote:

> 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am using a
> branch version. Currently trying to install the trunk version
> 1. The data I am trying to classify is from scientific papers -
> essentially the abstract title, text and keywords of there paper -
> example below
> 2. No data source is under 300 characters
> 3. I am training using the Mahout naive Bayes and am getting low
> incorrectly classified rates something like: 1.67% - I’m quite happy
> with that…
> 4. After I have trained the model Robin I use the Mahout naive Bayes
> classify() method to classify new (unseen) data (with the classification
> already known) - this is where I start to get problems -  I get very poor
> successful classification rates for new data. Something like: 82%
> unsuccessful classified.
> To Summarise: I get very good results in training and very poor results
> with new data.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message