mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Text Classification using Mahout
Date Tue, 28 Sep 2010 11:05:14 GMT

On Sep 27, 2010, at 1:53 PM, Neil Ghosh wrote:

> HI Grant, 
> 
> Thanks so much for responding.you can reply to this in the mailing list.I have changed
my problem a little bit more common one.
> 
> I have already gone through the tutorial written by you in IBM site.It was very good
to start with.Thanks anyway.
> To be specific my problem is to classify a piece text crawled from web into two classes

> 
> 1.It is a +ve feedback 
> 2.It is -ve feed back.
> 
> I can  use the two news group example and create a model with some text (may be a large
no of text ) by inputtng the trainer with these two labels.Should I leave everything to the
trainer completely like this ?
> 

Yes, that should be fine.  The trainer doesn't care about the name of the label, it just cares
that the two sets are relatively independent.  Keep in mind, you should set aside some of
your data for testing as well.

> Or Do I have flexibility to give some other input specific to my problem ? Such as if
words like "Problem", "Complaint" etc are more likely to appear in a text containing grievance.
 

You can provide a Weight, usually TF-IDF, that often does a good job of factoring in the importance
of words.  If you have certain sentiment words that you think influence things one way or
the other, you could consider a weighting process that adds weight to those words, I suppose,
but I would want to experiment with that a bit.

> 
> Please let me know if you have any ideas and need more info from my side.
> 
> Thanks
> Neil
> 
> On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> 
> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote:
> 
> > Is there any other examples/documents/reference how to use mahout for* text
> > classification.
> > *
> > I went through and ran the following
> >
> >
> >   1. Wikipedia Bayes
> > Example<https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html>-
> > Classify Wikipedia data.
> >
> >
> >   1. Twenty Newsgroups<https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html>-
> > Classify the classic Twenty Newsgroups data.
> >
> > However these two are not much definitive and there aren't much explanation
> > for the examples .Please share if there are more documentation.
> 
> 
> What kinds of problems are you looking to solve?  In general, we don't have too much
in the way of special things for text other than we have various utilities for converting
text into Mahout's vector format based on various weighting schemes.  Both of those examples
just take and convert the text into vectors and then either train or test on them.  I would
agree, though, that a good tutorial is needed.  It's a bit out of date in terms of the actual
commands, but I believe the concepts are still accurate: http://www.ibm.com/developerworks/java/library/j-mahout/
> 
> See https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground
(and the creating vectors section).  Also see the Algorithms section.
> 
> 
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
> 
> 
> 
> 
> -- 
> Thanks and Regards
> Neil
> http://neilghosh.com
> 
> 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message