mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Getting Started with Classification
Date Thu, 23 Jul 2009 20:48:42 GMT
More info:

Classifying docs (same train/test set) as "Republicans" or "Democrats"  
yields:
[java] Summary
      [java] -------------------------------------------------------
      [java] Correctly Classified Instances          :          
56           76.7123%
      [java] Incorrectly Classified Instances        :          
17           23.2877%
      [java] Total Classified Instances              :         73
      [java]
      [java] =======================================================
      [java] Confusion Matrix
      [java] -------------------------------------------------------
      [java] a           b       <--Classified as
      [java] 21          9        |  30          a     = democrats
      [java] 8           35       |  43          b     = republicans
      [java] Default Category: unknown: 2
      [java]
      [java]

For these, the training data was roughly equal in size (both about  
1.5MB) and for the test I got about 81% right for Republicans and 70%  
for the Democrats (does this imply Repub's do a better job of sticking  
to message on Wikipedia than Dems?  :-)   Would be interesting to  
train on a larger set).

-Grant

On Jul 22, 2009, at 9:50 PM, Robin Anil wrote:

> Did you try CBayes. Its supposed to negate the class imbalance effect
> to some extend
>
>
>
> On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<ted.dunning@gmail.com>  
> wrote:
>> Some learning algorithms deal with this better than others.  The  
>> problem is
>> particularly bad in information retrieval (negative examples  
>> include almost
>> the entire corpus, positives are a tiny fraction) and fraud (less  
>> than 1% of
>> the training data is typically fraud).
>>
>> Down-sampling the over-represented case is the simplest answer  
>> where you
>> have lots of data.  It doesn't help much to have more than 3x more  
>> data for
>> one case as another anyway (at least in binary decisions).
>>
>> Another aspect of this is the cost of different errors.  For  
>> instance, in
>> fraud, verifying a transaction with a customer has low cost (but not
>> non-zero) while not detecting a fraud in progress can be very, very  
>> bad.
>> False negatives are thus more of a problem than false positives and  
>> the
>> models are tuned accordingly.
>>
>> On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <miles@inf.ed.ac.uk>  
>> wrote:
>>
>>> this is the class imbalance problem  (ie you have many more  
>>> instances for
>>> one class than another one).
>>>
>>> in this case, you could ensure that the training set was balanced  
>>> (50:50);
>>> more interestingly, you can have a prior which corrects for this.   
>>> or, you
>>> could over-sample or even under-sample the training set, etc etc.
>>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message