mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhaskar Ghosh <bjgin...@yahoo.co.in>
Subject Re: How to get multi-language support for training/classifying text into classes through Mahout?
Date Sun, 03 Oct 2010 04:58:48 GMT
Thanks a lot Ted. I would try it.
 Regards
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"




________________________________
From: Ted Dunning <ted.dunning@gmail.com>
To: user@mahout.apache.org
Sent: Sun, 3 October, 2010 10:20:52 AM
Subject: Re: How to get multi-language support for training/classifying text 
into classes through Mahout?

Hindi should be pretty good to go with the default Lucene analyzer.  You
should look at the
tokens to be sure they are reasonable.  Punctuation and some other work
breaking characters
in Hindi may not be handled well, but if the first five sentences work well,
you should be OK.

On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <bjgindia@yahoo.co.in> wrote:

> Hi Ted,
>
> I need to tokenize Hindi, an Indian language. I learnt from Robin earlier
> that
> "Classifier supports non english tokens(its assumes string is Utf8
> encoded)",
> Does that mean that the Classifier would just tokenize based on unicode
> encoding, so that we do not need to worry about the language? Or, we do
> need to
> make some configurations?
>
> I do not have a knowledge of factors that makes a language harder to
> tokenize.
> But, I have learnt from earlier conversations in this mailing list, that
> languages in which a word is represented as multi-worded (sequence of
> words),
> are hard to tokenize. In that sense, I can assume that words in Hindi would
> be
> single words.
>
>  Thanks
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>
>
> ________________________________
> From: Ted Dunning <ted.dunning@gmail.com>
> To: user@mahout.apache.org
> Sent: Sun, 3 October, 2010 12:53:37 AM
> Subject: Re: How to get multi-language support for training/classifying
> text
> into classes through Mahout?
>
> You will need to make sure that the tokenization is done reasonable.
>
> There is an example program for a sequential classifier in
> org.apache.mahout.classifiers.sgd.TrainNewsGroups
>
> It assumes data in the 20 news groups format and uses a Lucene tokenizer.
>
> The NaiveBayes code also uses a Lucene tokenizer that you can specify on
> the
> command line.
>
> Can you say which languages?  Are they easy to tokenize (like French)?  Or
> medium (like German/Turkish)?
> Or hard (like Chinese/Japanese)?
>
> Can you say how much data?
>
> On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <bjgindia@yahoo.co.in>
> wrote:
>
> > Dear All,
> >
> > I have a requirement where I need to classify text in a non-English
> > language. I
> > have heard that Mahout supports multi-language. Can anyone please tell me
> > how do
> > I achieve this? Some documents/links where I can get some examples on
> this,
> > would be really really helpful.
> >  Regards
> > Bhaskar Ghosh
> > Hyderabad, India
> >
> > http://www.google.com/profiles/bjgindia
> >
> > "Ignorance is Bliss... Knowledge never brings Peace!!!"
> >
> >
> >
>
>
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message