The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and it has been ported
to ElasticSearch. Maybe those integrate better.
As to not doing some tokenization, I would think an extra tokenizer in you chain would be
just the thing.
-Paul
> -----Original Message-----
> From: Trejkaz [mailto:trejkaz@trypticon.org]
> Sent: Tuesday, January 08, 2013 3:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is StandardAnalyzer good enough for multi languages...
>
> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantoshi76@gmail.com> wrote:
> > DoesLucene StandardAnalyzer work for all the languagues for tokenizing
> > before indexing (since we are using java, I think the content is
> > converted to UTF-8 before tokenizing/indeing)?
>
> No. There are multiple cases where it chooses not to break something which it should
break. Some of
> these cases even result in undesirable behaviour for English, so I would be surprised
if there were even a
> single language which it handles acceptably.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|