lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <dan...@nuix.com>
Subject Fullwidth alphanumeric characters, plus a question on Korean ranges
Date Mon, 07 Jan 2008 00:47:22 GMT
Hi all.

We discovered that fullwidth letters are not treated as <LETTER> and fullwidth 
digits are not treated as <DIGIT>.

This in itself is probably easy to fix (including the filter for normalising 
these back to the normal versions) but while sanity checking the blocks in 
StandardTokenizer.jj I found some suspicious parts and felt it necessary to 
check that this is by design as there is no comment explaining the anomalies.

Line 87:
       "\uffa0"-"\uffdc"

  The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> as
  expected, so I'm wondering if these halfwidth Hangul "letters" should
  actually be in <KOREAN> instead of <LETTER>.

Line 92:
       "\u3040"-"\u318f",

  This block appears to duplicate the ranges in the next three lines and
  suspiciously also includes a range which belongs to <KOREAN>, making me
  wonder what happens when a range is in two blocks.

In case anyone is wondering, the JFlex version of the tokeniser on Lucene 
trunk has the same ranges.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message