mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 22:33:09 GMT
<done_basking>Grant</done_basking>

Here's an interesting piece:
09/07/22 18:23:02 INFO bayes.TestClassifier: Testing:wikipedia/ 
subjects/prepared-test/history.txt
09/07/22 18:23:07 INFO bayes.TestClassifier: history	95.458984375	 
3910/4096.0
09/07/22 18:23:07 INFO bayes.TestClassifier: --------------
09/07/22 18:23:07 INFO bayes.TestClassifier: Testing:/wikipedia/ 
subjects/prepared-test/science.txt
09/07/22 18:23:08 INFO bayes.TestClassifier: science	 
15.554072096128172	233/1498.0
09/07/22 18:23:08 INFO bayes.TestClassifier:  
=======================================================


In other words, I'm really good at predicting History as a category  
and really bad at predicting Science.

I think the following might help explain why:
ls -l
total 245360
-rwxrwxrwx  1 grantingersoll  staff  89518235 Jul 22 17:53 history.txt*
-rwxrwxrwx  1 grantingersoll  staff  36099183 Jul 22 17:53 science.txt*

The number of history examples is almost double the number of science  
based on my test set.

There is obviously a teaching moment here.  I know there is a lot out  
there about sample sizes, feature selection etc., can we boil some of  
these down into some cogent recommendations for our users?


-Grant

On Jul 22, 2009, at 5:23 PM, Grant Ingersoll wrote:

> <basking>Grant</basking>
>
> On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote:
>
>> Getting something to run is a big step.  It is important to bask in  
>> the glow
>> for a tiny moment.
>>
>> On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll  
>> <gsingers@apache.org>wrote:
>>
>>> Confusion Matrix
>>> -------------------------------------------------------
>>> a       b       <--Classified as
>>> 3910    186      |  4096        a     = history
>>> 1265    233      |  1498        b     = science
>>> Default Category: unknown: 2
>>> </snip>
>>>
>>> At least it's better than 50%, which is presumably a good  
>>> thing ;-)  I have
>>> no clue what the state of the art is these days, but it doesn't seem
>>> _horrendous_ either.
>>>
>>> I'd love to see someone validate what I have done.  Let me know if  
>>> you need
>>> more details.  I'd also like to know how I can improve it.
>>>
>>
>>
>>
>> -- 
>> Ted Dunning, CTO
>> DeepDyve
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message