mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Patterson <>
Subject Re: Ratio between positive and negative data in a classification model
Date Tue, 02 Oct 2012 15:16:22 GMT
This may also be relevant:

"Logistic Regression in Rare Events Data"


On Tue, Oct 2, 2012 at 7:09 AM, Ted Dunning <> wrote:
> Having lots of negative samples won't improve performance that much
> (shouldn't hurt much either).
> The negative examples that you really want are the ones that are close to
> your positive examples.
> On Mon, Oct 1, 2012 at 10:54 AM, Salman Mahmood <>wrote:
>> I am making a binary classifier. Lets assume the classifier decides if a
>> particular news item is about Appache or not.    I have got 200 positive
>> examples/news about Appache.
>> I am a bit confused about the negative examples, because there could be a
>> huge number of negative examples. What strategy should I go for when
>> preparing the negative data?
>> with 200 positive examples, will it make sense if I train the classifier
>> with 5000 negative data with examples from all other sectors of news
>> (finance, health, sports, misc, travel etc) or the difference between the
>> positive and the negative data should not be in thousands? in which case I
>> am afraid the classifier will not be properly trained trained.

Twitter: @jpatanooga
Principal Solution Architect @ Cloudera

View raw message