nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Britz, Thibaut" <>
Subject RE: International Parser
Date Fri, 25 Mar 2005 17:14:52 GMT

You could use sun.text.Normalizer. (see
Maybe you should also check first what language the text is written in, before applying the

-----Original Message-----
From: Doug Cutting []
Sent: Thu 3/24/2005 9:08 PM
Subject: Re: International Parser
Christophe Noel wrote:
> I need to develop a "french" parser. Google index french documents 
> parsing "é" (HTML : e&acute;) and "è" characters to "e". I think there's 
> is already french parser for Lucene, so this is not really a problem.

Would it be a problem to simply make this conversion for all languages? 
  Does Google distinguish between "é", "è" and "e" for other languages?

> Problem is : can it be created as a nutch plugin ?

It is a little complicated to add language-specific tokenization, since 
Nutch's tokenzier is currently defined together with its query parser, 
and each plugin should not have to re-write the query parser, as it is 
rather complex.

A good way to handle this might be to rewrite the query parser so that 
it uses a language-specific tokenizer as input.  Each plugin would 
define a tokenizer.  Plugins would be selected by language, with a 
configuration-defined default.  Most implementations would probably 
simply apply a token filter to the output of a standard tokenizer 
implementation.  The tokenizer must always split tokens at query syntax 
characters.  The query parser must then declare a list of query syntax 

Each plugin should also define a stop list.  In Nutch, stop lists are 
not used at index time, but rather only applied by the query parser to 
terms that are not either in a phrase or explicitly required.

So the API might look something like:

/** Factory to get plugin implementation. */
public class LanguageAnalyzerFactory {
   public static Analyzer getAnalyzer(String language);

/** Implemented by plugins. */
public interface LanguageAnalyzer {
   TokenStream getTokenStream(Reader reader);
   boolean isStopWord(String term);

/** A default implementation.  Most LanguageAnalyzer plugins will apply 
a filter to this. */
public class NutchTokenizer implements TokenStream {
   // returns the same strings as existing NutchAnalysis.term()
   public Token next();

Does this sound like the best approach?  Is anyone willing to try to 
implement this?  It requires JavaCC hacking...


View raw message