lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject StandardTokenizer and Unicode
Date Mon, 19 Aug 2002 16:45:08 GMT

Hi all,

Has anyone had any luck using StandardTokenizer for
Unicode behind Latin-1 set? I have tried to use it for
Cyrillic (U+0400..U+04FF) and it looks like the
characters don't get through, despite the fact that
Cyrillic IS included in StandardTokenizer.jj (i.e. is a
subset of Unicode symbols, used to describe the Letter
token). If I try to specify UNICODE_INPUT = true in
StandardTokenizer.jj (and disable USER_CHAR_STREAM =
true), it starts working perfectly.
So does that mean I have to have my own version of
StandardTokenizer to make Unicode input possible?

Boris Okner 

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message