mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Loic Descotte <>
Subject Re: SGD vs Naive Bayes for classification
Date Mon, 12 Sep 2011 07:54:46 GMT
Hi Zach and Ted,
Thanks a lot for your answers :)

So I will try to focus on SVM instead of SGD/Naive Bayes.
I'll also take a look to Rapid Miner and Luduan.

Mahout in Action is saying that SVM has been added to Mahout as "an 
experimental implementation"
Do you think it's usable for production anyway?



Le 09.09.2011 19:54, Zach Richardson a écrit :
> Hi Loic,
> In my experience, when dealing with smaller datasets (i.e. the number of
> training examples you have is less than, let's say 1000, or even less than
> 100 per category).  That a Linear SVM tends to perform better than Mahout's
> SGD.
> I would either recommend using Rapid Miner if you want a pretty gui and some
> configurable text import tools, or liblinear/libsvm from the command line.
>   The former will let you iterate quickly on what you are trying to do
> without any custom coding.  However, depending on how you want to deploy
> this, you might need to stick with liblinear / libsvm (rapidminer uses the
> libsvm library internally) for the true "deployable" system since the
> Rapidminer libraries are all AGPL.
> You can find examples for either online.  If you still are having problems,
> I would be more than happy to share a rapidminer pipeline for processing
> documents, training a classifier, etc.
> Zach
> On Fri, Sep 9, 2011 at 12:16 PM, Ted Dunning<>  wrote:
>> On Fri, Sep 9, 2011 at 8:41 AM, Loic Descotte<
>>> wrote:
>>> ... My goal is to make prediction on thousands of text entries, but with
>>> smaller as possible learning datas (categories may often change so I will
>>> not always have hundreds of entries for training on each category).
>> This is very small with respect to Mahout algorithms.  There may be better
>> options.  The standard choice for small text datasets like this is linear
>> SVM, but SGD should work reasonably well.  Naive Bayes may not work as well
>> with such a small amount of training data.  I would avoid the adaptive SGD
>> and tune the training parameters by hand.
>> Another question, in all exemples I've found, Naive Bayes is used to
>> analyze
>>> sets containing a lot keywords, and to classify them in the right
>> category
>>> (e.g wikipedia examples :**
>>> developerworks/java/library/j-**mahout/#N10412<
>>> SGD example are a little different, instead of working on word sequences,
>>> they use many predictors values and each predictor has only one value for
>>> each entry.
>> That is true in Chapter 13 where SGD is introduced.  Later chapters
>> illustrate the use on the 20 newsgroups data.
>>> Is it possible to use the SGD algorythm (maybe better for me because I
>> have
>>> small datasets) with only text (like blog posts) entries ?
>> Yes.  This should work fine.
>> I would consider also the Luduan algorithm which is not currently part of
>> Mahout, although all the pieces are there.
>> The basic idea is that for each binary decision you have three kinds of
>> documents.  These are unjudged documents, judged relevant documents and
>> judged non-relevant.  Luduan uses log-likelihood ratio test to compare the
>> judged relevant and judged non-relevant sets.  This comparison gives a set
>> of search terms that are used with standard retrieval weighting such as
>> tf-idf or BM-25.  Term weights are determined by corpus frequencies without
>> any explicit reference to the frequencies in the judged relevant or
>> non-relevant documents.
>> For some classification tasks with modest sized training data, this method
>> out-performs most others.
>> I can send a PDF with a more detailed description.
>>> Thanks a lot for your time, tell me if I'm not clear enough in my
>>> explainations :)
>> Please tell me the same.

View raw message