mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?
Date Thu, 02 Jun 2011 04:02:00 GMT
Ah! You have a domain problem here, in that your input set may not be
homogeneous over time. Data from different time periods might be
different. You may need to break your time series into bands, and
train&test with overlapping bands. That is, train on January through
April, then do two tests: verify with a test set from that time
period, and then test with March through July. This will show you
whether the data changes over time.

The AverageAbsoluteDifferenceRecommenderEvaluator does a train & test
across one dataset. It randomly picks, say, 80% for training and then
randomly picks perhaps 30-40% for testing. This is why I suggest using
overlapping bands up above: if the data changes over time, the two
tests will differ.


On Wed, Jun 1, 2011 at 8:03 PM, Xiaobo Gu <> wrote:
> On our site we will use Logistic Regression in a batch manner,
> customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31)
> will be used to train the model, and customers entered in another time
> frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model,
> then the model will be used to predict users entered after 2011/6/1,
> does this make sense, or should we feed all data from 2010/1/1 to
> 2011/5/31 to ALR, and let it do the hold-out internally?
> On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <> wrote:
>> You don't *have* to have a separate validation set, but it isn't a bad idea.
>> In particular, with large scale classifiers production data almost always
>> comes from the future with respect to the training data.  The ADR can't hold
>> out that way because it does on-line training only.  Thus, I would recommend
>> recommend that you still have some kind of evaluation hold-out set
>> segregated by time.
>> Another very serious issue can happen if you have near duplicates in your
>> data set.  That often happens in news-wire text, for example.  In that case,
>> you would have significant over-fitting with ADR and you wouldn't have a
>> clue without a real time-segregated hold-out set.
>> On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <> wrote:
>>> Hi,
>>> Because ADR split the training data internally automatically,so I
>>> think we don't have to make a separate validation data set.
>>> Regards,
>>> Xiaobo Gu

Lance Norskog

View raw message