mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Salman Mahmood <sal...@influestor.com>
Subject Re: The default category of a binary classifier
Date Thu, 20 Sep 2012 15:43:17 GMT
Thanks Ted and Lance for the suggestions!
On Sep 20, 2012, at 3:05 AM, Ted Dunning wrote:

> With SGD, you can train for an unclassified category, but the system will
> always produce scores for all trained categories.  You might interpret
> these to decide when there is no decision, but the model itself has no
> concept directly of "unclassified".
> 
> On Wed, Sep 19, 2012 at 4:55 PM, Lance Norskog <goksron@gmail.com> wrote:
> 
>> Shouldn't this be 'unclassified'? I think I have seen data in the
>> unclassified buckets with both Bayes and SGD.
>> 
>> ----- Original Message -----
>> | From: "Ted Dunning" <ted.dunning@gmail.com>
>> | To: user@mahout.apache.org
>> | Sent: Wednesday, September 19, 2012 2:54:25 PM
>> | Subject: Re: The default category of a binary classifier
>> |
>> | If a classifier is presented text with no words in common with the
>> | training
>> | data, it will give you back the most common category in the training
>> | data.
>> |
>> | That said, it is likely to be quite rare when a new document consists
>> | *entirely* of new words.  Any overlap with trained vocabulary is
>> | likely to
>> | over-ride the basic frequencies of different categories.
>> |
>> | On Wed, Sep 19, 2012 at 1:32 AM, Salman Mahmood
>> | <salman@influestor.com>wrote:
>> |
>> | > First, in mahout, is there a special way to create binary
>> | > classifier? for
>> | > instance if I am creating classifier for 20 news group data, I will
>> | > just
>> | > pass 20 as number of categories when creating the training object:
>> | >
>> | > new AdaptiveLogisticRegression(20, FEATURES, new L1())
>> | >
>> | > Similarly when creating a binary classifier, I will pass 2 as the
>> | > number
>> | > of categories and thats it?
>> | >
>> | > Having established that, what is the default category for a binary
>> | > classifier? Lets say I was building a classifier to recognize the
>> | > industry
>> | > sector for a news item. I have binary models to classify things
>> | > into
>> | > technology or not technology, banking or not banking, health or not
>> | > health
>> | > etc. I trained the technology model with technology related news as
>> | > positive and all the other news as negative (banking and health).
>> | > Now if
>> | > the technology model got a news item to classify, from the media
>> | > sector
>> | > (which it was not trained on), what is the expected behavior? Is it
>> | > gonna
>> | > say it's a technology news or its not a technology news? any
>> | > default
>> | > behavior for unseen/untrained news items?
>> | > Hope I made the question clear.
>> | > Thanks
>> |
>> 


Mime
View raw message