You are correct that it should work with smaller data as well, but the
tradeoffs are going to be very different.
In particular, some algorithms are completely infeasible at large scale,
but are very effective at small scale. Some like those used in glmnet
inherently require multiple passes through the data.
The Mahout committers have generally elected to spend time on larger scale
problems, especially where really good smallscale solutions already exist.
That could change if somebody wanted to come in and support some set of
algorithms (hint, hint).
On Fri, Nov 8, 2013 at 3:15 AM, Andreas Bauer <buki@gmx.net> wrote:
> Ok, I'll have a look. Thanks! I know mahout is intended for large scale
> machine learning, but I guess it shouldn't have problems with such small
> data either.
>
>
>
> Ted Dunning <ted.dunning@gmail.com> schrieb:
> >On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <buki@gmx.net> wrote:
> >
> >> Hi,
> >>
> >> Thanks for your comments.
> >>
> >> I modified the examples from the mahout in action book, therefore I
> >used
> >> the hashed approach and that's why i used 100 features. I'll adjust
> >the
> >> number.
> >>
> >
> >Makes sense. But the book was doing sparse features.
> >
> >
> >
> >> You say that I'm using the same CVE for all features, so you mean i
> >> should create 12 separate CVE for adding features to the vector like
> >this?
> >>
> >
> >Yes. Otherwise you don't get different hashes. With a CVE, the
> >hashing
> >pattern is generated from the name of the variable. For a work
> >encoder,
> >the hashing pattern is generated by the name of the variable (specified
> >at
> >construction of the encoder) and the word itself (specified at encode
> >time). Text is just repeated words except that the weights aren't
> >necessarily linear in the number of times a word appears.
> >
> >In your case, you could have used a goofy trick with a word encoder
> >where
> >the "word" is the variable name and the value of the variable is passed
> >as
> >the weight of the word.
> >
> >But all of this hashing is really just extra work for you. Easier to
> >just
> >pack your data into a dense vector.
> >
> >
> >> Finally, I thought online logistic regression meant that it is an
> >online
> >> algorithm so it's fine to train only once. Does it mean, should i
> >invoke
> >> the train method over and over again with the same training sample
> >until
> >> the next one arrives or how should i make the model converge (or at
> >least
> >> try to with the few samples) ?
> >>
> >
> >What online really implies is that training data is measured in terms
> >of
> >number of input records instead of in terms of passes through the data.
> > To
> >converge, you have to see enough data. If that means you need to pass
> >through the data several times to fool the learner ... well, it means
> >you
> >have to pass through the data several times.
> >
> >Some online learners are exact in that they always have the exact
> >result at
> >hand for all the data they have seen. Welford's algorithm for
> >computing
> >sample mean and variance is like that. Others approximate an answer.
> >Most
> >systems which are estimating some property of a distribution are
> >necessarily approximate. In fact, even Welford's method for means is
> >really only approximating the mean of the distribution based on what it
> >has
> >seen so far. It happens that it gives you the best possible estimate
> >so
> >far, but that is just because computing a mean is simple enough. With
> >regularized logistic regression, the estimation is trickier and you can
> >only say that the algorithm will converge to the correct result
> >eventually
> >rather than say that the answer is always as good as it can be.
> >
> >Another way to say it is that the key property of online learning is
> >that
> >the learning takes a fixed amount of time and no additional memory for
> >each
> >input example.
> >
> >
> >> What would you suggest to use for incremental training instead of
> >OLR? Is
> >> mahout perhaps the wrong library?
> >>
> >
> >Well, for thousands of examples, anything at all will work quite well,
> >even
> >R. Just keep all the data around and fit the data whenever requested.
> >
> >Take a look at glmnet for a very nicely done inmemory L1/L2
> >regularized
> >learner. A quick experiment indicates that it will handle 200K samples
> >of
> >the sort you are looking in about a second with multiple levels of
> >lambda
> >thrown into the bargain. Versions available in R, Matlab and Fortran
> >(at
> >least).
> >
> >http://wwwstat.stanford.edu/~tibs/glmnetmatlab/
> >
> >This kind of inmemory, single machine problem is just not what Mahout
> >is
> >intended to solve.
>
>
