lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: Fullwidth alphanumeric characters, plus a question on Korean ranges
Date Thu, 10 Jan 2008 19:54:30 GMT
Hi Daniel,

On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> I wish the tokeniser could just use Character.isLetter and
> Character.isDigit instead of having to know all the ranges itself, since
> the JRE already has all this information.  Character.isLetter does
> return true for CJK characters though, so the ranges would still come in
> handy for determining what kind of letter they are.  I don't support
> JFlex has a way to do this...

Well, a quick perusal of the JFlex docs indicate that just such a facility is available. 
From <http://jflex.de/manual.html#SECTION00053000000000000000> (edited for brevity:
'...' indicates elided material):

    -----
    Lexical Rules : Syntax
    ...
       RegExp       ::= RegExp '|' RegExp | ... | PredefinedClass | ...
       PredefinedClass ::= ... | '[:letter:]' | '[:digit:]' | ...
    ...
    Lexical Rules : Semantics
    ...
       [:letter:]       isLetter()
       [:digit:]        isDigit()
    -----

The DIGIT macro could be replaced by the predefined character class [:digit:].

Although isLetter() (and so also [:letter:]) includes CJK characters, there is a way to handle
this - from Lexical Rules : Semantics (<http://jflex.de/manual.html#SECTION00053000000000000000>):

    -----
    !a
        (negation)

        matches everything but the strings matched by a. Use with care:
        the construction of !a involves an additional, possibly exponential
        NFA to DFA transformation on the NFA for a. Note that with negation
        and union you also have (by applying DeMorgan) intersection and set
        difference: the intersection of a and b is !(!a|!b), the expression
        that matches everything of a not matched by b is !(!a|b) 
    -----

Using the /!(!a|b)/ syntax to exclude CJ characters from the LETTER macro:

    LETTER = ! ( ! [:letter:] | {CJ} )
 
> On Tuesday 08 January 2008 05:17:28 Steven A Rowe wrote:
> > On 01/06/2008 at 7:47 PM, Daniel Noll wrote:
> > > We discovered [in StandardTokenizer.jj] that fullwidth letters are
> > > not treated as <LETTER> and fullwidth digits are not
> > > treated as <DIGIT>.
> > 
> > IMHO, this should be fixed in the JFlex version of StandardTokenizer -
> > do you have details?
> 
> The following ranges are relevant here:
> 
>   FF10-FF19  Fullwidth digits
>   FF21-FF3A  Fullwidth Latin uppercase
>   FF41-FF5A  Fullwidth Latin lowercase

Note that these are properly covered by [:digit:] and [:letter:].

> > > Line 87:
> > >        "\uffa0"-"\uffdc"
> > > 
> > >   The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ>
> > >   as expected, so I'm wondering if these halfwidth Hangul "letters"
> > >   should actually be in <KOREAN> instead of <LETTER>.
> 
> > However, I just noticed that [U+1100-U+11FF] is included both in the
> > <LETTER> and <KOREAN> sections - not good.  I think [U+1100-U+11FF]
> > should be removed from the <LETTER> definition, and left as-is in the
> > <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER>
> > to <KOREAN>.

Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat them
separately; unlike Chinese and Japanese characters, which are individually tokenized, the
Korean characters should participate in the same token boundary rules as all of the other
letters.

> I had a bit more of a look through the Unicode blocks and
> found some more ranges which may or may not be worth considering.

I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports, and
Unicode 5.0, the latest version, and there are lots of new and modified letter and digit ranges.
 This stuff gets tweaked all the time, and I don't think Lucene should be in the business
of trying to track it, or take a position on which Unicode version users' data should conform
to.  

Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties (most
of) these decisions to the user's choice of JVM version, and this seems much more reasonable
to me than the current status quo.

I will create a JIRA issue and attach a patch.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message