mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <d...@apache.org>
Subject Re: What are the ways to train and run classifiers on text?
Date Sun, 26 Sep 2010 20:05:30 GMT
Hi Bhaskar,

Thake a look at the latest from svn trunk:
https://svn.apache.org/repos/asf/mahout/trunk/, you'll find the
TrainNewsGroups class in the examples project. It is alll pretty new,
so there are no docs on the wiki, but the code is very readable.

If you are interested in working with the Bayes classifiers, take a
look at the classifier.bayes.* package in the example project. The
PrepareTwentyNewsgroups example converts a bunch of files organized
into directories into the Bayes input format, iirc.

Drew

On Sun, Sep 26, 2010 at 1:17 PM, Bhaskar Ghosh <bjgindia@yahoo.co.in> wrote:
> Thanks Ted. But, I am unable to find the org.apache.mahout.classifier.sgd
> package. I could only locate the classifier.bayes.* packages
>
>  Thanks
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>
>
> ________________________________
> From: Ted Dunning <ted.dunning@gmail.com>
> To: user@mahout.apache.org
> Sent: Sun, 26 September, 2010 9:40:17 AM
> Subject: Re: What are the ways to train and run classifiers on text?
>
> Take a look also at TrainNewsGroups in the classifier.sgd package in
> examples.
>
> That shows how to parse documents for use with an SGD classifier (different
> from NaiveBayes).
>
> There is much more format flexibility with an API oriented approach.
>
> On Sun, Sep 26, 2010 at 9:37 AM, Bhaskar Ghosh <bjgindia@yahoo.co.in> wrote:
>
>> Dear All,
>>
>> I need to classify a bunch of text files, so determine which class does
>> each one
>> of these texts fall.
>>
>>
>> Now I have seen through the 20Newsgroups example. I see that the input text
>> files need to have a particular format:
>>
>> <class-label> <tab> <unique features (words) associated with the
>> class-label>
>>
>>
>> But the real question is how do I get such a pre-processed input file? Do I
>> need
>> to process the input text files, to get it into the required format? Then
>> it
>> would required extracting the unique words/features from the raw text, in
>> addition to assigning class-labels, as well.
>>
>> OR
>>
>> There is some classifier class that can take raw input files? My input
>> would be
>> something like:
>>
>> <class-label1> <file1-text>
>> <class-label2> <file3-text>
>> <class-label1> <file2-text>
>> etc.
>>
>>
>> Thanks
>> Bhaskar Ghosh
>> Hyderabad, India
>>
>> http://www.google.com/profiles/bjgindia
>>
>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>
>>
>>
>
>
>

Mime
View raw message