mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Incremental training of Classifier
Date Tue, 29 Dec 2009 09:40:23 GMT
On Tue, Dec 29, 2009 at 10:45 AM, Mani Kumar <manikumarchauhan@gmail.com>wrote:

> @Robin: thanks! btw whats the reasoning behind using CBayes for >2
> categories? While bayes works for spam/not spam kinda classification, why
> not for > 2 categories. It'd great if you can give some pointers to read
> and
> understand.
>
Just a slight diff in math behind it. CBayes is Bayes but tries to classify
objects as not belonging to a class instead of belonging to a class. For
more read insight you can read up the paper on Complementary Naive Bayes. Do
a quick experiment on 20 news groups with CBayes and Bayes. You will see the
difference.



> @Ted: Currently i just started experimentation with mahout, and don't have
> a
> very clear picture of how it can work for us. I'll let you details as i get
> more experience with mahout and more deeper understanding of our
> requirement.
>
> Thanks!
> Mani Kumar
>
> On Tue, Dec 29, 2009 at 6:14 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > mani,
> >
> > You are sounding more and more like the poster child for an on-line
> > classifier.
> >
> > The idea would be that you would give your classified docs to the system
> > first for testing, then again for incremental training.  You can use the
> > results of the test to adjust the learning rate for the incremental
> > learning.
> >
> > See the work I have started with MAHOUT-228 for the beginnings of this.
> >  Let
> > me know where it should go to help with your needs (i.e. what entry
> points
> > that you would need).
> >
> > On Mon, Dec 28, 2009 at 1:33 PM, Mani Kumar <manikumarchauhan@gmail.com
> > >wrote:
> >
> > > lets talk about bigger numbers e.g. i have more than 1 million docs and
> i
> > > get 10k new docs every day out of which 6k is already classified.
> > >
> > > Monitoring performance is good but it can be done weekly instead of
> daily
> > > just to reduce cost.
> > >
> > > I actually wanted to avoid the retraining as much as possible because
> it
> > > comes with huge cost for large dataset.
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message