mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SGD diferent confusion matrix for each run
Date Sat, 01 Sep 2012 03:34:58 GMT
Frankly, you get a better approximation of the underlying distribution of
samples if you sample *with* replacement.  This means just pick a uniform
sample from the training data each time and limit by the number of samples,
not the number of passes through the data.

The idea of SGD is sample centric and depends on you taking a random sample
from the underlying distribution of training data.  Convergence and all is
in terms of the number of samples and the closer you can come to sampling
from the real distribution, the better the process will approximate the
mathematical idea.  When you have a fixed and finite sample of training
data instead of something that samples from the real distribution, then you
have to approximate the underlying distribution using the bootstrap [1] and
that is best done using sampling with replacement rather than repeated
sampling without replacement.

[1] http://en.wikipedia.org/wiki/Bootstrapping_(statistics)

On Fri, Aug 31, 2012 at 11:24 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> That would be best, but practically speaking, randomizing once is usually
> OK.  With a tiny data set like this that is in memory anyway, I wouldn't
> take any chances.
>
>
> On Fri, Aug 31, 2012 at 9:08 PM, Lance Norskog <goksron@gmail.com> wrote:
>
>> "Try passing through the data 100 times for a start. "
>>
>> And randomize the order each time?
>>
>> On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <salman@influestor.com>
>> wrote:
>> > Cheers ted. Appreciate the input!
>> >
>> > Sent from my iPhone
>> >
>> > On 31 Aug 2012, at 17:53, Ted Dunning <ted.dunning@gmail.com> wrote:
>> >
>> >> OK.
>> >>
>> >> Try passing through the data 100 times for a start.  I think that this
>> is
>> >> likely to fix your problems.
>> >>
>> >> Be warned that AdaptiveLogisticRegression has been misbehaving lately
>> and
>> >> may converge faster than it should.
>> >>
>> >> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <salman@influestor.com
>> >wrote:
>> >>
>> >>> Thanks a lot ted. Here are the answers:
>> >>> d) Data (news articles from different feeds)
>> >>>        News Article 1: Title : BP Profits Plunge On Massive Asset
>> >>> Write-down
>> >>>                                    Description :BP PLC (BP) Tuesday
>> >>> posted a dramatic fall of 96% in adjusted profit for the
>> >>> second quarter as it wrote down the value of its assets by $5 billion
>> >>> including some U.S. refineries a suspended Alaskan oil project and
>> U.S.
>> >>> shale gas resources
>> >>>
>> >>>        News Article 2: Title : Morgan Stanley Missed Big
>> >>>                                     Description: Why It's Still A
>> >>> Fantastic Short,"By Mike Williams: Though the market responded very
>> >>> positively to Citigroup (C) and Bank of America's (BAC) reserve
>> >>> release-driven earnings ""beats"" last week's Morgan Stanley (MS)
>> earnings
>> >>> report illustrated what happens when a bank doesn't have billions of
>> >>> reserves to release back into earnings. Estimates called for the
>> following:
>> >>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt
>> value
>> >>> adjustment) $7.7 billion in revenue GAAP results (including the DVA)
>> came
>> >>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
>> >>> particular disappointment coming in at $6.95 billion.
>> >>>
>> >>> c) As you can see the data is textual. and I am using title and
>> >>> description as predictor variable and the target variable is the
>> company
>> >>> name a news belongs to.
>> >>>
>> >>> b) I am passing through the data once (at least this is what I
>> think). I
>> >>> folowed the 20newsgroup example code(in java) and dint find that the
>> data
>> >>> was passed more than once.
>> >>> Yes I randomize the order every time.
>> >>>
>> >>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>>
>> >>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>> >>>
>> >>>> First, this is a tiny training set.  You are well outside the
>> intended
>> >>>> application range so you are likely to find less experience in the
>> >>>> community in that range.  That said, the algorithm should still
>> produce
>> >>>> reasonably stable results.
>> >>>>
>> >>>> Here are a few questions:
>> >>>>
>> >>>> a) which class are you using to train your model?  I would start
with
>> >>>> OnlineLogisticRegression and experiment with training rate schedules
>> and
>> >>>> amount of regularization to find out how to build a good model.
>> >>>>
>> >>>> b) how many times are you passing through your data?  Do you
>> randomize
>> >>> the
>> >>>> order each time?  These are critical to proper training.  Instead
of
>> >>>> randomizing order, you could just sample a data point at random
and
>> not
>> >>>> worry about using a complete permutation of the data.  With such
a
>> tiny
>> >>>> data set, you will need to pass through the data many times ...
>> possibly
>> >>>> hundreds of times or more.
>> >>>>
>> >>>> c) what kind of data do you have?  Sparse?  Dense?  How many
>> variables?
>> >>>> What kind?
>> >>>>
>> >>>> d) can you post your data?
>> >>>>
>> >>>>
>> >>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <
>> salman@influestor.com
>> >>>> wrote:
>> >>>>
>> >>>>> Thanks a lot lance. Let me elaborate the problem if it was a
bit
>> >>> confusing.
>> >>>>>
>> >>>>> Assuming I am making a binary classifier using SGD. I have got
50
>> >>> positive
>> >>>>> and 50 negative examples to train the classifier. After training
and
>> >>>>> testing the model, the confusion matrix tells you the number
of
>> >>> correctly
>> >>>>> and incorrectly classified instances. Let's assume I got 85%
>> correct and
>> >>>>> 15% incorrect instances.
>> >>>>>
>> >>>>> Now if I run my program again using the same 50 negative and
50
>> positive
>> >>>>> examples, then according to my knowledge the classifier should
>> yield the
>> >>>>> same results as before (cause not even a single training or
testing
>> data
>> >>>>> was changed), but this is not the case. I get different results
for
>> >>>>> different runs. The confusion matrix figures changes each time
I
>> >>> generate a
>> >>>>> model keeping the data constant. What I do is, I generate a
model
>> >>> several
>> >>>>> times and keep a look for the accuracy, and if it is above 90%,
>> then I
>> >>> stop
>> >>>>> running the code and hence an accurate model is created.
>> >>>>>
>> >>>>> So what you are saying is to shuffle my data before I use it
for
>> >>> training
>> >>>>> and testing?
>> >>>>> Thanks!
>> >>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>> >>>>>
>> >>>>>> Now I remember: SGD wants its data input in random order.
You need
>> to
>> >>>>>> permute the order of your data.
>> >>>>>>
>> >>>>>> If that does not help, another trick: for each data point,
randomly
>> >>>>>> generate 5 or 10 or 20 points which are close. And again,
randomly
>> >>>>>> permute the entire input set.
>> >>>>>>
>> >>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <goksron@gmail.com>
>> >>>>> wrote:
>> >>>>>>> The more data you have, the closer each run will be.
How much
>> data do
>> >>>>> you have?
>> >>>>>>>
>> >>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
>> >>> salman@influestor.com>
>> >>>>> wrote:
>> >>>>>>>> I have noticed that every time I train and test
a model using the
>> >>> same
>> >>>>> data (in SGD algo), I get different confusion matrix. Meaning,
if I
>> >>>>> generate a model and look at the confusion matrix, it might
say 90%
>> >>>>> correctly classified instances, but if I generate the model
again
>> (with
>> >>> the
>> >>>>> SAME data for training and testing as before) and test it, the
>> confusion
>> >>>>> matrix changes and it might say 75% correctly classified instances.
>> >>>>>>>>
>> >>>>>>>> Is this a desired behavior?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Lance Norskog
>> >>>>>>> goksron@gmail.com
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Lance Norskog
>> >>>>>> goksron@gmail.com
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message