lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <>
Subject Fullwidth alphanumeric characters, plus a question on Korean ranges
Date Mon, 07 Jan 2008 00:47:22 GMT
Hi all.

We discovered that fullwidth letters are not treated as <LETTER> and fullwidth 
digits are not treated as <DIGIT>.

This in itself is probably easy to fix (including the filter for normalising 
these back to the normal versions) but while sanity checking the blocks in 
StandardTokenizer.jj I found some suspicious parts and felt it necessary to 
check that this is by design as there is no comment explaining the anomalies.

Line 87:

  The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> as
  expected, so I'm wondering if these halfwidth Hangul "letters" should
  actually be in <KOREAN> instead of <LETTER>.

Line 92:

  This block appears to duplicate the ranges in the next three lines and
  suspiciously also includes a range which belongs to <KOREAN>, making me
  wonder what happens when a range is in two blocks.

In case anyone is wondering, the JFlex version of the tokeniser on Lucene 
trunk has the same ranges.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message