mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Bauer <b...@gmx.net>
Subject Re: OnlineLogisticRegression: Are my settings sensible
Date Fri, 08 Nov 2013 11:15:00 GMT
Ok,  I'll have a look. Thanks! I know mahout is intended for large scale machine learning,
 but I guess it shouldn't have problems with such small data either. 



Ted Dunning <ted.dunning@gmail.com> schrieb:
>On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <buki@gmx.net> wrote:
>
>> Hi,
>>
>> Thanks for your comments.
>>
>> I modified the examples from the mahout in action book,  therefore I
>used
>> the hashed approach and that's why i used 100 features. I'll adjust
>the
>> number.
>>
>
>Makes sense.  But the book was doing sparse features.
>
>
>
>> You say that I'm using the same CVE for all features,  so you mean i
>> should create 12 separate CVE for adding features to the vector like
>this?
>>
>
>Yes.  Otherwise you don't get different hashes.  With a CVE, the
>hashing
>pattern is generated from the name of the variable.  For a work
>encoder,
>the hashing pattern is generated by the name of the variable (specified
>at
>construction of the encoder) and the word itself (specified at encode
>time).  Text is just repeated words except that the weights aren't
>necessarily linear in the number of times a word appears.
>
>In your case, you could have used a goofy trick with a word encoder
>where
>the "word" is the variable name and the value of the variable is passed
>as
>the weight of the word.
>
>But all of this hashing is really just extra work for you.  Easier to
>just
>pack your data into a dense vector.
>
>
>> Finally, I thought online logistic regression meant that it is an
>online
>> algorithm so it's fine to train only once. Does it mean, should i
>invoke
>> the train method over and over again with the same training sample
>until
>> the next one arrives or how should i make the model converge (or at
>least
>> try to with the few samples) ?
>>
>
>What online really implies is that training data is measured in terms
>of
>number of input records instead of in terms of passes through the data.
> To
>converge, you have to see enough data.  If that means you need to pass
>through the data several times to fool the learner ... well, it means
>you
>have to pass through the data several times.
>
>Some online learners are exact in that they always have the exact
>result at
>hand for all the data they have seen.  Welford's algorithm for
>computing
>sample mean and variance is like that. Others approximate an answer. 
>Most
>systems which are estimating some property of a distribution are
>necessarily approximate.  In fact, even Welford's method for means is
>really only approximating the mean of the distribution based on what it
>has
>seen so far.  It happens that it gives you the best possible estimate
>so
>far, but that is just because computing a mean is simple enough.  With
>regularized logistic regression, the estimation is trickier and you can
>only say that the algorithm will converge to the correct result
>eventually
>rather than say that the answer is always as good as it can be.
>
>Another way to say it is that the key property of on-line learning is
>that
>the learning takes a fixed amount of time and no additional memory for
>each
>input example.
>
>
>> What would you suggest to use for incremental training instead of
>OLR?  Is
>> mahout perhaps the wrong library?
>>
>
>Well, for thousands of examples, anything at all will work quite well,
>even
>R.  Just keep all the data around and fit the data whenever requested.
>
>Take a look at glmnet for a very nicely done in-memory L1/L2
>regularized
>learner.  A quick experiment indicates that it will handle 200K samples
>of
>the sort you are looking in about a second with multiple levels of
>lambda
>thrown into the bargain.  Versions available in R, Matlab and Fortran
>(at
>least).
>
>http://www-stat.stanford.edu/~tibs/glmnet-matlab/
>
>This kind of in-memory, single machine problem is just not what Mahout
>is
>intended to solve.


Mime
View raw message