Ok, I'll have a look. Thanks! I know mahout is intended for large scale machine learning,
but I guess it shouldn't have problems with such small data either.
Ted Dunning <ted.dunning@gmail.com> schrieb:
>On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <buki@gmx.net> wrote:
>
>> Hi,
>>
>> Thanks for your comments.
>>
>> I modified the examples from the mahout in action book, therefore I
>used
>> the hashed approach and that's why i used 100 features. I'll adjust
>the
>> number.
>>
>
>Makes sense. But the book was doing sparse features.
>
>
>
>> You say that I'm using the same CVE for all features, so you mean i
>> should create 12 separate CVE for adding features to the vector like
>this?
>>
>
>Yes. Otherwise you don't get different hashes. With a CVE, the
>hashing
>pattern is generated from the name of the variable. For a work
>encoder,
>the hashing pattern is generated by the name of the variable (specified
>at
>construction of the encoder) and the word itself (specified at encode
>time). Text is just repeated words except that the weights aren't
>necessarily linear in the number of times a word appears.
>
>In your case, you could have used a goofy trick with a word encoder
>where
>the "word" is the variable name and the value of the variable is passed
>as
>the weight of the word.
>
>But all of this hashing is really just extra work for you. Easier to
>just
>pack your data into a dense vector.
>
>
>> Finally, I thought online logistic regression meant that it is an
>online
>> algorithm so it's fine to train only once. Does it mean, should i
>invoke
>> the train method over and over again with the same training sample
>until
>> the next one arrives or how should i make the model converge (or at
>least
>> try to with the few samples) ?
>>
>
>What online really implies is that training data is measured in terms
>of
>number of input records instead of in terms of passes through the data.
> To
>converge, you have to see enough data. If that means you need to pass
>through the data several times to fool the learner ... well, it means
>you
>have to pass through the data several times.
>
>Some online learners are exact in that they always have the exact
>result at
>hand for all the data they have seen. Welford's algorithm for
>computing
>sample mean and variance is like that. Others approximate an answer.
>Most
>systems which are estimating some property of a distribution are
>necessarily approximate. In fact, even Welford's method for means is
>really only approximating the mean of the distribution based on what it
>has
>seen so far. It happens that it gives you the best possible estimate
>so
>far, but that is just because computing a mean is simple enough. With
>regularized logistic regression, the estimation is trickier and you can
>only say that the algorithm will converge to the correct result
>eventually
>rather than say that the answer is always as good as it can be.
>
>Another way to say it is that the key property of online learning is
>that
>the learning takes a fixed amount of time and no additional memory for
>each
>input example.
>
>
>> What would you suggest to use for incremental training instead of
>OLR? Is
>> mahout perhaps the wrong library?
>>
>
>Well, for thousands of examples, anything at all will work quite well,
>even
>R. Just keep all the data around and fit the data whenever requested.
>
>Take a look at glmnet for a very nicely done inmemory L1/L2
>regularized
>learner. A quick experiment indicates that it will handle 200K samples
>of
>the sort you are looking in about a second with multiple levels of
>lambda
>thrown into the bargain. Versions available in R, Matlab and Fortran
>(at
>least).
>
>http://wwwstat.stanford.edu/~tibs/glmnetmatlab/
>
>This kind of inmemory, single machine problem is just not what Mahout
>is
>intended to solve.
