mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?
Date Wed, 01 Jun 2011 14:18:41 GMT
You don't *have* to have a separate validation set, but it isn't a bad idea.

In particular, with large scale classifiers production data almost always
comes from the future with respect to the training data.  The ADR can't hold
out that way because it does on-line training only.  Thus, I would recommend
recommend that you still have some kind of evaluation hold-out set
segregated by time.

Another very serious issue can happen if you have near duplicates in your
data set.  That often happens in news-wire text, for example.  In that case,
you would have significant over-fitting with ADR and you wouldn't have a
clue without a real time-segregated hold-out set.

On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <guxiaobo1982@gmail.com> wrote:

> Hi,
>
> Because ADR split the training data internally automatically,so I
> think we don't have to make a separate validation data set.
>
> Regards,
>
> Xiaobo Gu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message