lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <>
Subject RE: Fullwidth alphanumeric characters, plus a question on Korean ranges
Date Mon, 07 Jan 2008 18:17:28 GMT
Hi Daniel,

I think this discussion belongs on java-dev, so I'm replying there.

On 01/06/2008 at 7:47 PM, Daniel Noll wrote:
> We discovered [in StandardTokenizer.jj] that fullwidth letters are
> not treated as <LETTER> and fullwidth digits are not treated as <DIGIT>.

IMHO, this should be fixed in the JFlex version of StandardTokenizer - do you have details?

Concerning handling of Korean characters, some recent StandardTokenizer.jj history:

StandardTokenizer loses Korean characters

StandardTokenizer splitting all of Korean words into separate characters

CJK char list

> [W]hile sanity checking the blocks in StandardTokenizer.jj I found
> some suspicious parts and felt it necessary to check that this is by
> design as there is no comment explaining the anomalies.
> Line 87:
>        "\uffa0"-"\uffdc"
>   The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ>
>   as expected, so I'm wondering if these halfwidth Hangul "letters"
>   should actually be in <KOREAN> instead of <LETTER>.

[U+FFA0-U+FFDC] is Hangul Jamo (phonetic symbols), not precomposed Hangul syllables.

The patch for LUCENE-478 modified the <LETTER> definition to include this range in order
to be consistent with inclusion of their full-width versions ([U+1100-U+11FF])* in the <LETTER>
definition, since time immemorial:


However, I just noticed that [U+1100-U+11FF] is included both in the <LETTER> and <KOREAN>
sections - not good.  I think [U+1100-U+11FF] should be removed from the <LETTER> definition,
and left as-is in the <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER>
to <KOREAN>.

> Line 92:
>        "\u3040"-"\u318f",
>   This block appears to duplicate the ranges in the next three lines and
>   suspiciously also includes a range which belongs to <KOREAN>, making
>   me wonder what happens when a range is in two blocks.

Otis Gospodnetic expanded this range to include comments on the specific ranges, and must
have forgotten to remove the original range on line 92:


Here are the ranges in question:

   [U+3040-U+309F] - Japanese Hiragana
   [U+30A0-U+30FF] - Japanese Katakana
   [U+3100-U+312F] - Chinese Bopomofo
   [U+3130-U+318F] - Korean Hangul Compatibility Jamo

I agree with your assessment - the range on line 92 should be removed, since with the exception
of the Hangul compatibility Jamo range, which should be moved to the <KOREAN> section,
[U+3040-U+318F] is already covered by the Hiragana, Katakana, and Bopomofo ranges already
included in the <CJ> section.

Of course, since the JavaCC grammar is no longer in Lucene-Java trunk, these modifications
should be made in StandardTokenizerImpl.jflex.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message