mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 20:05:19 GMT
The model size is much smaller with unigrams.  :-)

I'm not quite sure what constitutes good just yet, but, I can report  
the following using the commands I reported earlier w/ the exception  
that I am using unigrams:

I have two categories:  History and Science

0. Splitter:
--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml -- 
outputDir /PATH/wikipedia/chunks -c 64

Then prep:
--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/ 
test --categories PATH/mahout-clean/examples/src/test/resources/ 
(also do this for the training set)

1. Train set:
ls ../chunks
chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml   
chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml   
chunk-0033.xml  chunk-0037.xml
chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml   
chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml   
chunk-0034.xml  chunk-0038.xml
chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml   
chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml   
chunk-0035.xml  chunk-0039.xml
chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml   
chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml   

2. Test Set:
chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml   
chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml   
chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml

3. Run the Trainer on the train set:
--input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/ 
model --gramSize 1 --classifierType bayes

4. Run the TestClassifier.

--model PATH/wikipedia/subjects/model --testDir PATH/wikipedia/ 
subjects/test --gramSize 1 --classifierType bayes

Output is:

9/07/22 15:55:09 INFO bayes.TestClassifier:  
Correctly Classified Instances          :       4143	   74.0615%
Incorrectly Classified Instances        :       1451	   25.9385%
Total Classified Instances              :       5594

Confusion Matrix
a    	b    	<--Classified as
3910 	186  	 |  4096  	a     = history
1265 	233  	 |  1498  	b     = science
Default Category: unknown: 2

At least it's better than 50%, which is presumably a good thing ;-)  I  
have no clue what the state of the art is these days, but it doesn't  
seem _horrendous_ either.

I'd love to see someone validate what I have done.  Let me know if you  
need more details.  I'd also like to know how I can improve it.

On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:

> Indeed.  I hadn't snapped to the fact you were using trigrams.
> 30 million features is quite plausible for that.  To effectively use  
> long
> n-grams as features in classification of documents you really need  
> to have
> the following:
> a) good statistical methods for resolving what is useful and what is  
> not.
> Everybody here knows that my preference for a first hack is  
> sparsification
> with log-likelihood ratios.
> b) some kind of smoothing using smaller n-grams
> c) some kind of smoothing over variants of n-grams.
> AFAIK, mahout doesn't have many or any of these in place.  You are  
> likely to
> do better with unigrams as a result.
> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll  
> <>wrote:
>> I suspect the explosion in the number of features, Ted, is due to  
>> the use
>> of n-grams producing a lot of unique terms.  I can try w/ gramSize  
>> = 1, that
>> will likely reduce the feature set quite a bit.
> -- 
> Ted Dunning, CTO
> DeepDyve

View raw message