mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: SGD diferent confusion matrix for each run
Date Sat, 01 Sep 2012 01:08:48 GMT
"Try passing through the data 100 times for a start. "

And randomize the order each time?

On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <salman@influestor.com> wrote:
> Cheers ted. Appreciate the input!
>
> Sent from my iPhone
>
> On 31 Aug 2012, at 17:53, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> OK.
>>
>> Try passing through the data 100 times for a start.  I think that this is
>> likely to fix your problems.
>>
>> Be warned that AdaptiveLogisticRegression has been misbehaving lately and
>> may converge faster than it should.
>>
>> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <salman@influestor.com>wrote:
>>
>>> Thanks a lot ted. Here are the answers:
>>> d) Data (news articles from different feeds)
>>>        News Article 1: Title : BP Profits Plunge On Massive Asset
>>> Write-down
>>>                                    Description :BP PLC (BP) Tuesday
>>> posted a dramatic fall of 96% in adjusted profit for the
>>> second quarter as it wrote down the value of its assets by $5 billion
>>> including some U.S. refineries a suspended Alaskan oil project and U.S.
>>> shale gas resources
>>>
>>>        News Article 2: Title : Morgan Stanley Missed Big
>>>                                     Description: Why It's Still A
>>> Fantastic Short,"By Mike Williams: Though the market responded very
>>> positively to Citigroup (C) and Bank of America's (BAC) reserve
>>> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings
>>> report illustrated what happens when a bank doesn't have billions of
>>> reserves to release back into earnings. Estimates called for the following:
>>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value
>>> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came
>>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
>>> particular disappointment coming in at $6.95 billion.
>>>
>>> c) As you can see the data is textual. and I am using title and
>>> description as predictor variable and the target variable is the company
>>> name a news belongs to.
>>>
>>> b) I am passing through the data once (at least this is what I think). I
>>> folowed the 20newsgroup example code(in java) and dint find that the data
>>> was passed more than once.
>>> Yes I randomize the order every time.
>>>
>>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>>>
>>>> First, this is a tiny training set.  You are well outside the intended
>>>> application range so you are likely to find less experience in the
>>>> community in that range.  That said, the algorithm should still produce
>>>> reasonably stable results.
>>>>
>>>> Here are a few questions:
>>>>
>>>> a) which class are you using to train your model?  I would start with
>>>> OnlineLogisticRegression and experiment with training rate schedules and
>>>> amount of regularization to find out how to build a good model.
>>>>
>>>> b) how many times are you passing through your data?  Do you randomize
>>> the
>>>> order each time?  These are critical to proper training.  Instead of
>>>> randomizing order, you could just sample a data point at random and not
>>>> worry about using a complete permutation of the data.  With such a tiny
>>>> data set, you will need to pass through the data many times ... possibly
>>>> hundreds of times or more.
>>>>
>>>> c) what kind of data do you have?  Sparse?  Dense?  How many variables?
>>>> What kind?
>>>>
>>>> d) can you post your data?
>>>>
>>>>
>>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <salman@influestor.com
>>>> wrote:
>>>>
>>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit
>>> confusing.
>>>>>
>>>>> Assuming I am making a binary classifier using SGD. I have got 50
>>> positive
>>>>> and 50 negative examples to train the classifier. After training and
>>>>> testing the model, the confusion matrix tells you the number of
>>> correctly
>>>>> and incorrectly classified instances. Let's assume I got 85% correct
and
>>>>> 15% incorrect instances.
>>>>>
>>>>> Now if I run my program again using the same 50 negative and 50 positive
>>>>> examples, then according to my knowledge the classifier should yield
the
>>>>> same results as before (cause not even a single training or testing data
>>>>> was changed), but this is not the case. I get different results for
>>>>> different runs. The confusion matrix figures changes each time I
>>> generate a
>>>>> model keeping the data constant. What I do is, I generate a model
>>> several
>>>>> times and keep a look for the accuracy, and if it is above 90%, then
I
>>> stop
>>>>> running the code and hence an accurate model is created.
>>>>>
>>>>> So what you are saying is to shuffle my data before I use it for
>>> training
>>>>> and testing?
>>>>> Thanks!
>>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>>>>>
>>>>>> Now I remember: SGD wants its data input in random order. You need
to
>>>>>> permute the order of your data.
>>>>>>
>>>>>> If that does not help, another trick: for each data point, randomly
>>>>>> generate 5 or 10 or 20 points which are close. And again, randomly
>>>>>> permute the entire input set.
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <goksron@gmail.com>
>>>>> wrote:
>>>>>>> The more data you have, the closer each run will be. How much
data do
>>>>> you have?
>>>>>>>
>>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
>>> salman@influestor.com>
>>>>> wrote:
>>>>>>>> I have noticed that every time I train and test a model using
the
>>> same
>>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>>>>> generate a model and look at the confusion matrix, it might say 90%
>>>>> correctly classified instances, but if I generate the model again (with
>>> the
>>>>> SAME data for training and testing as before) and test it, the confusion
>>>>> matrix changes and it might say 75% correctly classified instances.
>>>>>>>>
>>>>>>>> Is this a desired behavior?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>
>>>>>
>>>
>>>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message