mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mani Kumar <>
Subject Re: Incremental training of Classifier
Date Mon, 28 Dec 2009 21:33:52 GMT
Thanks for the quick response.

@Robin absolutely agree on your suggestion regarding using 600 docs for
monitoring performance.

lets talk about bigger numbers e.g. i have more than 1 million docs and i
get 10k new docs every day out of which 6k is already classified.

Monitoring performance is good but it can be done weekly instead of daily
just to reduce cost.

I actually wanted to avoid the retraining as much as possible because it
comes with huge cost for large dataset.

Better solution could that we'll use 50k docs from every category order by
created_at desc, to reduce the amount of data and stay tuned with latest

Thanks a lot guys.

-Mani Kumar

On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <> wrote:

> On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <> wrote:
> > Long answer, You can use your 600 docs to test the classifier and see
> your
> > accuracy. Then retrain with the entire documents and then test a test
> data
> > set. So daily you can choose to include or exclude the 600 documents that
> > come and ensure that you keep your classifier at the top performance.
> >  After
> > some amount of documents, you dont get much benefit of retraining.
> Further
> > training would only add over fitting errors.
> >
> The suggestion that the 600 new documents be used to monitor performance is
> an excellent one.
> It should be pretty easy to add the "train on incremental data" option to
> K-means.
> Also, the k-means algorithm definitely will reach a point of diminishing
> returns, but it should be very resistant to over training.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message