mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mani Kumar <manikumarchau...@gmail.com>
Subject Re: Incremental training of Classifier
Date Mon, 28 Dec 2009 22:24:22 GMT
Comments inline.

-Mani Kumar

On Tue, Dec 29, 2009 at 3:14 AM, Robin Anil <robin.anil@gmail.com> wrote:

> with a 50K set, you may/may not loose out on some features. Depends
> entirely
> on the data. If you dont mind answering - What is the number of categories
> that you have?
>

  ~50 categories

>
> I agree that re-training 1 million docs is cumbersome. But if i remember
> correctly, I trained(CBayes) on a 3GB subject of wikipedia on 6 pentium-4
> HT
> systems in 20 mins.


-- thats fast.


> I dont know how big your data or how big your cluster
> is. But a daily 1 hour map/reduce job is not that expensive (Maybe I am
> blind and have no sense of what is big after working at google). I say, try
> and estimate it yourself.


-- daily 1 hour is not an issue but daily 6-8 hours will be an issue.


>
> On the other hand. You could also try a dual fold approach. A sturdy 1
> million docs trained classifier and recent 50K docs classifier. And do some
> form of voting.
>
> I am sure you will not be able to load the 1mil model in to memory, you
> might need to use Hbase there. Instead you can use 50K model in  memory for
> fast classification. Then run a batch classification job daily to
> re-classify your dataset based on the 1mil model
>

-- yes, i'll have to use hbase only. thanks!


>
> Robin
>
>
>
> On Tue, Dec 29, 2009 at 3:03 AM, Mani Kumar <manikumarchauhan@gmail.com
> >wrote:
>
> > Thanks for the quick response.
> >
> > @Robin absolutely agree on your suggestion regarding using 600 docs for
> > monitoring performance.
> >
> > lets talk about bigger numbers e.g. i have more than 1 million docs and i
> > get 10k new docs every day out of which 6k is already classified.
> >
> > Monitoring performance is good but it can be done weekly instead of daily
> > just to reduce cost.
> >
> > I actually wanted to avoid the retraining as much as possible because it
> > comes with huge cost for large dataset.
> >
> > Better solution could that we'll use 50k docs from every category order
> by
> > created_at desc, to reduce the amount of data and stay tuned with latest
> > trends.
> >
> > Thanks a lot guys.
> >
> > -Mani Kumar
> >
> > On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <robin.anil@gmail.com>
> > wrote:
> > >
> > > > Long answer, You can use your 600 docs to test the classifier and see
> > > your
> > > > accuracy. Then retrain with the entire documents and then test a test
> > > data
> > > > set. So daily you can choose to include or exclude the 600 documents
> > that
> > > > come and ensure that you keep your classifier at the top performance.
> > > >  After
> > > > some amount of documents, you dont get much benefit of retraining.
> > > Further
> > > > training would only add over fitting errors.
> > > >
> > >
> > > The suggestion that the 600 new documents be used to monitor
> performance
> > is
> > > an excellent one.
> > >
> > > It should be pretty easy to add the "train on incremental data" option
> to
> > > K-means.
> > >
> > > Also, the k-means algorithm definitely will reach a point of
> diminishing
> > > returns, but it should be very resistant to over training.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message