lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <>
Subject Re: Is StandardAnalyzer good enough for multi languages...
Date Wed, 09 Jan 2013 06:25:44 GMT
Dude.  Go look.  It allows for per-script specialization, with (non-UAX#29) specializations
by default for Thai, Lao, Myanmar and Hewbrew.  See DefaultICUTokenizerConfig.  It's filled
with exactly the opposite of what you were describing. 

ICUTokenizerFactory's customizability has been enhanced in to-be-released Lucene/Solr 4.1:
<> - you can provide per-script RuleBasedBreakIterator
specification files at runtime. 

On Jan 9, 2013, at 12:12 AM, Trejkaz <> wrote:

> On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe <> wrote:
>> Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest
to you, along with the token filters in that same module. - Steve
> ICUTokenizer sounds like it's implementing UAX #29, which is exactly
> the standard filled with all the issues I was describing. Unless it
> does more than that, I would recommend against using that also.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message