mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 20:20:32 GMT

On Jul 22, 2009, at 4:13 PM, Miles Osborne wrote:

> it is probably good to benchmark against standard datasets.  for text
> classification this tends to be the Reuters set:
>
> http://www.daviddlewis.com/resources/testcollections/
>
> this way you know if you are doing a good job

Yeah, good point.  Only problem is, for my demo, I am doing it all on  
Wikipedia, because I want coherent examples and don't want to have to  
introduce another dataset.  I know there are a few areas for error in  
the process, since we are just picking a single category for a  
document even though they have multiple, furthermore, we are picking  
the first category that matches, even thought multiple input  
categories might be present, or even, both categories in one (i.e.  
History of Science)

Still, good to try out w/ the Reuters collection as well.  Sigh, I'll  
put it on the list to do.


>
> Miles
>
> 2009/7/22 Grant Ingersoll <gsingers@apache.org>
>
>> The model size is much smaller with unigrams.  :-)
>>
>> I'm not quite sure what constitutes good just yet, but, I can  
>> report the
>> following using the commands I reported earlier w/ the exception  
>> that I am
>> using unigrams:
>>
>> I have two categories:  History and Science
>>
>> 0. Splitter:
>> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
>> --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml -- 
>> outputDir
>> /PATH/wikipedia/chunks -c 64
>>
>> Then prep:
>> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
>> --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/ 
>> subjects/test
>> --categories PATH/mahout-clean/examples/src/test/resources/ 
>> subjects.txt
>> (also do this for the training set)
>>
>> 1. Train set:
>> ls ../chunks
>> chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
>> chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
>> chunk-0033.xml  chunk-0037.xml
>> chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
>> chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
>> chunk-0034.xml  chunk-0038.xml
>> chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
>> chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
>> chunk-0035.xml  chunk-0039.xml
>> chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
>> chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
>> chunk-0036.xml
>>
>> 2. Test Set:
>> ls
>> chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
>> chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
>> chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
>> chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml
>>
>> 3. Run the Trainer on the train set:
>> --input PATH/wikipedia/subjects/out --output PATH/wikipedia/ 
>> subjects/model
>> --gramSize 1 --classifierType bayes
>>
>> 4. Run the TestClassifier.
>>
>> --model PATH/wikipedia/subjects/model --testDir
>> PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
>>
>> Output is:
>>
>> <snip>
>> 9/07/22 15:55:09 INFO bayes.TestClassifier:
>> =======================================================
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances          :       4143       74.0615%
>> Incorrectly Classified Instances        :       1451       25.9385%
>> Total Classified Instances              :       5594
>>
>> =======================================================
>> Confusion Matrix
>> -------------------------------------------------------
>> a       b       <--Classified as
>> 3910    186      |  4096        a     = history
>> 1265    233      |  1498        b     = science
>> Default Category: unknown: 2
>> </snip>
>>
>> At least it's better than 50%, which is presumably a good  
>> thing ;-)  I have
>> no clue what the state of the art is these days, but it doesn't seem
>> _horrendous_ either.
>>
>> I'd love to see someone validate what I have done.  Let me know if  
>> you need
>> more details.  I'd also like to know how I can improve it.
>>
>> On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
>>
>> Indeed.  I hadn't snapped to the fact you were using trigrams.
>>>
>>> 30 million features is quite plausible for that.  To effectively  
>>> use long
>>> n-grams as features in classification of documents you really need  
>>> to have
>>> the following:
>>>
>>> a) good statistical methods for resolving what is useful and what  
>>> is not.
>>> Everybody here knows that my preference for a first hack is  
>>> sparsification
>>> with log-likelihood ratios.
>>>
>>> b) some kind of smoothing using smaller n-grams
>>>
>>> c) some kind of smoothing over variants of n-grams.
>>>
>>> AFAIK, mahout doesn't have many or any of these in place.  You are  
>>> likely
>>> to
>>> do better with unigrams as a result.
>>>
>>> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
>>>> wrote:
>>>
>>> I suspect the explosion in the number of features, Ted, is due to  
>>> the use
>>>> of n-grams producing a lot of unique terms.  I can try w/  
>>>> gramSize = 1,
>>>> that
>>>> will likely reduce the feature set quite a bit.
>>>>
>>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>>
>>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in  
> Scotland,
> with registration number SC005336.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message