lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hill <>
Subject RE: Is StandardAnalyzer good enough for multi languages...
Date Tue, 08 Jan 2013 23:54:58 GMT
The ICU project ( ) has Analyzers for Lucene and it has been ported
to ElasticSearch.  Maybe those integrate better.

As to not doing some tokenization, I would think an extra tokenizer in you chain would be
just the thing.


> -----Original Message-----
> From: Trejkaz []
> Sent: Tuesday, January 08, 2013 3:44 PM
> To:
> Subject: Re: Is StandardAnalyzer good enough for multi languages...
> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <> wrote:
> > DoesLucene StandardAnalyzer work for all the languagues for tokenizing
> > before indexing (since we are using java, I think the content is
> > converted to UTF-8 before tokenizing/indeing)?
> No. There are multiple cases where it chooses not to break something which it should
break. Some of
> these cases even result in undesirable behaviour for English, so I would be surprised
if there were even a
> single language which it handles acceptably.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message