mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?
Date Thu, 02 Jun 2011 04:15:04 GMT
I prefer to make my final held-out set look as much like it will in
production.  So if you plan to retrain every week, I would train on all
available data up to time t and then test on data from t to t+1week.

ALR's internal hold-out set is useful, but things change over time and
having a held out sample from the future (relative to the model) is much
more realistic.

On Wed, Jun 1, 2011 at 8:03 PM, Xiaobo Gu <guxiaobo1982@gmail.com> wrote:

> On our site we will use Logistic Regression in a batch manner,
> customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31)
> will be used to train the model, and customers entered in another time
> frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model,
> then the model will be used to predict users entered after 2011/6/1,
> does this make sense, or should we feed all data from 2010/1/1 to
> 2011/5/31 to ALR, and let it do the hold-out internally?
>
>
>
> On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > You don't *have* to have a separate validation set, but it isn't a bad
> idea.
> >
> > In particular, with large scale classifiers production data almost always
> > comes from the future with respect to the training data.  The ADR can't
> hold
> > out that way because it does on-line training only.  Thus, I would
> recommend
> > recommend that you still have some kind of evaluation hold-out set
> > segregated by time.
> >
> > Another very serious issue can happen if you have near duplicates in your
> > data set.  That often happens in news-wire text, for example.  In that
> case,
> > you would have significant over-fitting with ADR and you wouldn't have a
> > clue without a real time-segregated hold-out set.
> >
> > On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <guxiaobo1982@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> Because ADR split the training data internally automatically,so I
> >> think we don't have to make a separate validation data set.
> >>
> >> Regards,
> >>
> >> Xiaobo Gu
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message