mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: The default category of a binary classifier
Date Wed, 19 Sep 2012 23:55:10 GMT
Shouldn't this be 'unclassified'? I think I have seen data in the unclassified buckets with
both Bayes and SGD.

----- Original Message -----
| From: "Ted Dunning" <ted.dunning@gmail.com>
| To: user@mahout.apache.org
| Sent: Wednesday, September 19, 2012 2:54:25 PM
| Subject: Re: The default category of a binary classifier
| 
| If a classifier is presented text with no words in common with the
| training
| data, it will give you back the most common category in the training
| data.
| 
| That said, it is likely to be quite rare when a new document consists
| *entirely* of new words.  Any overlap with trained vocabulary is
| likely to
| over-ride the basic frequencies of different categories.
| 
| On Wed, Sep 19, 2012 at 1:32 AM, Salman Mahmood
| <salman@influestor.com>wrote:
| 
| > First, in mahout, is there a special way to create binary
| > classifier? for
| > instance if I am creating classifier for 20 news group data, I will
| > just
| > pass 20 as number of categories when creating the training object:
| >
| > new AdaptiveLogisticRegression(20, FEATURES, new L1())
| >
| > Similarly when creating a binary classifier, I will pass 2 as the
| > number
| > of categories and thats it?
| >
| > Having established that, what is the default category for a binary
| > classifier? Lets say I was building a classifier to recognize the
| > industry
| > sector for a news item. I have binary models to classify things
| > into
| > technology or not technology, banking or not banking, health or not
| > health
| > etc. I trained the technology model with technology related news as
| > positive and all the other news as negative (banking and health).
| > Now if
| > the technology model got a news item to classify, from the media
| > sector
| > (which it was not trained on), what is the expected behavior? Is it
| > gonna
| > say it's a technology news or its not a technology news? any
| > default
| > behavior for unseen/untrained news items?
| > Hope I made the question clear.
| > Thanks
| 

Mime
View raw message