mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Patterson <j...@cloudera.com>
Subject Re: Ratio between positive and negative data in a classification model
Date Tue, 02 Oct 2012 15:16:22 GMT
This may also be relevant:

"Logistic Regression in Rare Events Data"

http://gking.harvard.edu/gking/files/abs/0s-abs.shtml

JP

On Tue, Oct 2, 2012 at 7:09 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Having lots of negative samples won't improve performance that much
> (shouldn't hurt much either).
>
> The negative examples that you really want are the ones that are close to
> your positive examples.
>
> On Mon, Oct 1, 2012 at 10:54 AM, Salman Mahmood <salman@influestor.com>wrote:
>
>> I am making a binary classifier. Lets assume the classifier decides if a
>> particular news item is about Appache or not.    I have got 200 positive
>> examples/news about Appache.
>> I am a bit confused about the negative examples, because there could be a
>> huge number of negative examples. What strategy should I go for when
>> preparing the negative data?
>> with 200 positive examples, will it make sense if I train the classifier
>> with 5000 negative data with examples from all other sectors of news
>> (finance, health, sports, misc, travel etc) or the difference between the
>> positive and the negative data should not be in thousands? in which case I
>> am afraid the classifier will not be properly trained trained.
>>
>>
>>



-- 
Twitter: @jpatanooga
Principal Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Mime
View raw message