mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Naive Bayes vs. SGD
Date Mon, 06 Dec 2010 15:43:28 GMT
Sort of kind of correct.

Naive Bayes does need to be trained on all of whatever data that you want it
to learn from.  If you want
to retrain it, you have to give it a bunch of data again, but it doesn't
have to include the old data.  You
do have to keep a bunch of data around for training.

But this isn't because of any non-parametric property of the Naive Bayes
algorithm.  It is because of the
way that the program is written using map-reduce.  An online version of the
algorithm is quite feasible.  We
just don't have one.

SGD, on the other hand, is precisely the opposite.  It is only available in
an online version.

Being on-line, SGD can stop early if your model gets to the point you want.
 It is also really
fast which is nice when you don't want to spend all the effort to spin up a
Hadoop cluster
and get a map-reduce program started.  SGD can also be trained incrementally
with new
data without retaining the old data.

On the other hand, this may all be academic since it is very common to want
to make many
versions of a model with different variables or different target variables.
 This means that you
need to keep a bunch of training data around in any case.

On Mon, Dec 6, 2010 at 5:32 AM, Frank Wang <> wrote:

> Hi,
> I'm working on a text classification problem. Given a piece of content, it
> will be classified into 1 or more categories.
> From my understanding, Naive Bayes model is non-parametric, so every
> training requires all the cumulated sample data. However, if I were to use
> SGD model with n binary logistic regression, I wouldn't need to keep the
> historical sample data. Which seems will lead to faster training in the
> long
> run.
> Is this a fair logic?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message