mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Incremental training of Classifier
Date Mon, 28 Dec 2009 21:44:07 GMT
with a 50K set, you may/may not loose out on some features. Depends entirely
on the data. If you dont mind answering - What is the number of categories
that you have?

I agree that re-training 1 million docs is cumbersome. But if i remember
correctly, I trained(CBayes) on a 3GB subject of wikipedia on 6 pentium-4 HT
systems in 20 mins. I dont know how big your data or how big your cluster
is. But a daily 1 hour map/reduce job is not that expensive (Maybe I am
blind and have no sense of what is big after working at google). I say, try
and estimate it yourself.

On the other hand. You could also try a dual fold approach. A sturdy 1
million docs trained classifier and recent 50K docs classifier. And do some
form of voting.

I am sure you will not be able to load the 1mil model in to memory, you
might need to use Hbase there. Instead you can use 50K model in  memory for
fast classification. Then run a batch classification job daily to
re-classify your dataset based on the 1mil model

Robin



On Tue, Dec 29, 2009 at 3:03 AM, Mani Kumar <manikumarchauhan@gmail.com>wrote:

> Thanks for the quick response.
>
> @Robin absolutely agree on your suggestion regarding using 600 docs for
> monitoring performance.
>
> lets talk about bigger numbers e.g. i have more than 1 million docs and i
> get 10k new docs every day out of which 6k is already classified.
>
> Monitoring performance is good but it can be done weekly instead of daily
> just to reduce cost.
>
> I actually wanted to avoid the retraining as much as possible because it
> comes with huge cost for large dataset.
>
> Better solution could that we'll use 50k docs from every category order by
> created_at desc, to reduce the amount of data and stay tuned with latest
> trends.
>
> Thanks a lot guys.
>
> -Mani Kumar
>
> On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <robin.anil@gmail.com>
> wrote:
> >
> > > Long answer, You can use your 600 docs to test the classifier and see
> > your
> > > accuracy. Then retrain with the entire documents and then test a test
> > data
> > > set. So daily you can choose to include or exclude the 600 documents
> that
> > > come and ensure that you keep your classifier at the top performance.
> > >  After
> > > some amount of documents, you dont get much benefit of retraining.
> > Further
> > > training would only add over fitting errors.
> > >
> >
> > The suggestion that the 600 new documents be used to monitor performance
> is
> > an excellent one.
> >
> > It should be pretty easy to add the "train on incremental data" option to
> > K-means.
> >
> > Also, the k-means algorithm definitely will reach a point of diminishing
> > returns, but it should be very resistant to over training.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message