mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Bauer <>
Subject Re: OnlineLogisticRegression: Are my settings sensible
Date Fri, 08 Nov 2013 11:15:00 GMT
Ok,  I'll have a look. Thanks! I know mahout is intended for large scale machine learning,
 but I guess it shouldn't have problems with such small data either. 

Ted Dunning <> schrieb:
>On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <> wrote:
>> Hi,
>> Thanks for your comments.
>> I modified the examples from the mahout in action book,  therefore I
>> the hashed approach and that's why i used 100 features. I'll adjust
>> number.
>Makes sense.  But the book was doing sparse features.
>> You say that I'm using the same CVE for all features,  so you mean i
>> should create 12 separate CVE for adding features to the vector like
>Yes.  Otherwise you don't get different hashes.  With a CVE, the
>pattern is generated from the name of the variable.  For a work
>the hashing pattern is generated by the name of the variable (specified
>construction of the encoder) and the word itself (specified at encode
>time).  Text is just repeated words except that the weights aren't
>necessarily linear in the number of times a word appears.
>In your case, you could have used a goofy trick with a word encoder
>the "word" is the variable name and the value of the variable is passed
>the weight of the word.
>But all of this hashing is really just extra work for you.  Easier to
>pack your data into a dense vector.
>> Finally, I thought online logistic regression meant that it is an
>> algorithm so it's fine to train only once. Does it mean, should i
>> the train method over and over again with the same training sample
>> the next one arrives or how should i make the model converge (or at
>> try to with the few samples) ?
>What online really implies is that training data is measured in terms
>number of input records instead of in terms of passes through the data.
> To
>converge, you have to see enough data.  If that means you need to pass
>through the data several times to fool the learner ... well, it means
>have to pass through the data several times.
>Some online learners are exact in that they always have the exact
>result at
>hand for all the data they have seen.  Welford's algorithm for
>sample mean and variance is like that. Others approximate an answer. 
>systems which are estimating some property of a distribution are
>necessarily approximate.  In fact, even Welford's method for means is
>really only approximating the mean of the distribution based on what it
>seen so far.  It happens that it gives you the best possible estimate
>far, but that is just because computing a mean is simple enough.  With
>regularized logistic regression, the estimation is trickier and you can
>only say that the algorithm will converge to the correct result
>rather than say that the answer is always as good as it can be.
>Another way to say it is that the key property of on-line learning is
>the learning takes a fixed amount of time and no additional memory for
>input example.
>> What would you suggest to use for incremental training instead of
>OLR?  Is
>> mahout perhaps the wrong library?
>Well, for thousands of examples, anything at all will work quite well,
>R.  Just keep all the data around and fit the data whenever requested.
>Take a look at glmnet for a very nicely done in-memory L1/L2
>learner.  A quick experiment indicates that it will handle 200K samples
>the sort you are looking in about a second with multiple levels of
>thrown into the bargain.  Versions available in R, Matlab and Fortran
>This kind of in-memory, single machine problem is just not what Mahout
>intended to solve.

View raw message