lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: Fullwidth alphanumeric characters, plus a question on Korean ranges
Date Mon, 07 Jan 2008 18:17:28 GMT
Hi Daniel,

I think this discussion belongs on java-dev, so I'm replying there.

On 01/06/2008 at 7:47 PM, Daniel Noll wrote:
> We discovered [in StandardTokenizer.jj] that fullwidth letters are
> not treated as <LETTER> and fullwidth digits are not treated as <DIGIT>.

IMHO, this should be fixed in the JFlex version of StandardTokenizer - do you have details?

Concerning handling of Korean characters, some recent StandardTokenizer.jj history:

StandardTokenizer loses Korean characters
   http://issues.apache.org/jira/browse/LUCENE-444

StandardTokenizer splitting all of Korean words into separate characters
   http://issues.apache.org/jira/browse/LUCENE-461

CJK char list
   http://issues.apache.org/jira/browse/LUCENE-478

> [W]hile sanity checking the blocks in StandardTokenizer.jj I found
> some suspicious parts and felt it necessary to check that this is by
> design as there is no comment explaining the anomalies.
> 
> Line 87:
>        "\uffa0"-"\uffdc"
> 
>   The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ>
>   as expected, so I'm wondering if these halfwidth Hangul "letters"
>   should actually be in <KOREAN> instead of <LETTER>.

[U+FFA0-U+FFDC] is Hangul Jamo (phonetic symbols), not precomposed Hangul syllables.

The patch for LUCENE-478 modified the <LETTER> definition to include this range in order
to be consistent with inclusion of their full-width versions ([U+1100-U+11FF])* in the <LETTER>
definition, since time immemorial:

<http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?revision=149570&view=markup&pathrev=149570>

However, I just noticed that [U+1100-U+11FF] is included both in the <LETTER> and <KOREAN>
sections - not good.  I think [U+1100-U+11FF] should be removed from the <LETTER> definition,
and left as-is in the <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER>
to <KOREAN>.

> Line 92:
>        "\u3040"-"\u318f",
> 
>   This block appears to duplicate the ranges in the next three lines and
>   suspiciously also includes a range which belongs to <KOREAN>, making
>   me wonder what happens when a range is in two blocks.

Otis Gospodnetic expanded this range to include comments on the specific ranges, and must
have forgotten to remove the original range on line 92:

<http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=431151&r2=431152&pathrev=431152&diff_format=h>

Here are the ranges in question:

   [U+3040-U+309F] - Japanese Hiragana
   [U+30A0-U+30FF] - Japanese Katakana
   [U+3100-U+312F] - Chinese Bopomofo
   [U+3130-U+318F] - Korean Hangul Compatibility Jamo

I agree with your assessment - the range on line 92 should be removed, since with the exception
of the Hangul compatibility Jamo range, which should be moved to the <KOREAN> section,
[U+3040-U+318F] is already covered by the Hiragana, Katakana, and Bopomofo ranges already
included in the <CJ> section.

Of course, since the JavaCC grammar is no longer in Lucene-Java trunk, these modifications
should be made in StandardTokenizerImpl.jflex.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message