mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neal Richter <>
Subject Re: Text Classification using Mahout
Date Mon, 27 Sep 2010 19:51:17 GMT

  Is your classification task online or offline?  Ie will you need a
classification for a piece of text live within some web-service?


  I've put up a very easy to use implementation of NaiveBayes here:

  It's an extension of a perl implementation from Dr Dobb's Journal.
The article is a good reference as well for people no already familiar
with the math.  I suggest you experiment with this before attempting
to scale-up with Mahout+Hadoop.  Note that this implementation does
not do any TFIDF normalization as the Mahout ones do.

  The great thing about the above NaiveBayes is that the 'training' of
the model is a trivial extension of the "word count" job of hadoop
101.  Output is "<word>, <label>, <count>"

  Obviously one should layer in TFIDF for better accuracy, once you
understand the basics.


 If your application will require online classification of text,
mahout+hadoop really only helps for the training phase... assuming you
software can't wait minutes for an answer from hadoop.

 For quick-n-dirty text classification I've simply used Solr.

 1) Load your training examples as documents into Solr
      Simple approach is one document per label
 2) Search the index with the text you wish to classify
 3) Come up with some mechanism to use the Solr scores to make final decisions
 4) Full boosting syntax and fields from Solr are usable for more
structured classifications.

 A system I wrote to do this has been live in EC2 for almost 2 years
doing about 50M classifications per day across 35+ topical labels.
About 20B usages so far, works fine and is accurate enough for our

  Here are some references to using TFIDF for text classification

Thanks - Neal

On Mon, Sep 27, 2010 at 11:53 AM, Neil Ghosh <> wrote:
> HI Grant,
> Thanks so much for can reply to this in the mailing list.I
> have changed my problem a little bit more common one.
> I have already gone through the tutorial written by you in IBM site.It was
> very good to start with.Thanks anyway.
> To be specific my problem is to classify a piece text crawled from web into
> two classes
> 1.It is a +ve feedback
> 2.It is -ve feed back.
> I can  use the two news group example and create a model with some text (may
> be a large no of text ) by inputtng the trainer with these two labels.Should
> I leave everything to the trainer completely like this ?
> Or Do I have flexibility to give some other input specific to my problem ?
> Such as if words like "Problem", "Complaint" etc are more likely to appear
> in a text containing grievance.
> Please let me know if you have any ideas and need more info from my side.
> Thanks
> Neil
> On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <>wrote:
>> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote:
>> > Is there any other examples/documents/reference how to use mahout for*
>> text
>> > classification.
>> > *
>> > I went through and ran the following
>> >
>> >
>> >   1. Wikipedia Bayes
>> > Example<>-
>> > Classify Wikipedia data.
>> >
>> >
>> >   1. Twenty Newsgroups<
>> > Classify the classic Twenty Newsgroups data.
>> >
>> > However these two are not much definitive and there aren't much
>> explanation
>> > for the examples .Please share if there are more documentation.
>> What kinds of problems are you looking to solve?  In general, we don't have
>> too much in the way of special things for text other than we have various
>> utilities for converting text into Mahout's vector format based on various
>> weighting schemes.  Both of those examples just take and convert the text
>> into vectors and then either train or test on them.  I would agree, though,
>> that a good tutorial is needed.  It's a bit out of date in terms of the
>> actual commands, but I believe the concepts are still accurate:
>> See
the creating vectors section).  Also see the Algorithms section.
>> --------------------------
>> Grant Ingersoll
>> Apache Lucene/Solr Conference, Boston Oct 7-8
> --
> Thanks and Regards
> Neil

View raw message