mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Library for scalable logistic regression
Date Thu, 06 May 2010 17:15:43 GMT
Glad to hear that you have made good use of Mahout so far.

My recommendations right now for scalable classifiers are generally in the
SGD area, the canonical example of which is Vowpal Wabbit.  Another
benchmark implementation is glmnet which does Lasso and elastic band
regularization.  Vowpal Wabbit will definitely scale to the size you are
talking about, but truly shines on very large feature spaces.  Glmnet is
very, very good and very efficient, but assumes an in-core implementation
right now, thus limiting applicability to your problem.

With only 100 features, my guess is that you can train a main-effects model
with a relatively small subset of your data, particularly if you have an
asymmetric target.  You can also use the standard "train-on-errors"
techniques to augment your original sampled dataset so as to still have a
small training set which captures what you need out of your larger dataset.
 This might be particularly helpful if you want to train on interactions.

The general procedure there would be to

a) train a main-effects model on about 1M balanced sample
b) scan your full dataset and retain about 1M samples that have the worst
errors
c) build a fancy new model on the 2M samples
d) rinse, repeat while AUC improves


On Thu, May 6, 2010 at 9:15 AM, Danny Leshem <dleshem@gmail.com> wrote:

> Hi!
>
> I'm currently working on a rather large-scale dataset (~300M samples
> represented as dense vectors of cardinality ~100).
> The data lives in an EC2 Hadoop cluster and pre-processed using MR jobs,
> including heavy usage of Mahout (Lanczos decomposition, clustering, etc).
>
> I'm now looking for ways to learn a logistic regression model based on the
> data.
> So far I postponed this part of the project, hoping for
> MAHOUT-228<https://issues.apache.org/jira/browse/MAHOUT-228>to be
> ready... but unfortunately I can't afford to wait any more :)
>
> Looking around, I've found Google's
> sofia-ml<http://code.google.com/p/sofia-ml/>and some UC Berkeley
> Hadoop-based
> implementation<
> http://berkeley-mltea.pbworks.com/Hadoop-for-Machine-Learning-Guide>
> .
> Anyone has experience with these, or knows of / used a good library for
> logistic regressions of this scale?
>
> Thanks,
> Danny
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message