lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1689) supplementary character handling
Date Sat, 13 Jun 2009 18:12:07 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yonik Seeley updated LUCENE-1689:
---------------------------------

    Fix Version/s:     (was: 2.9)
                   3.1

bq. I am curious how you plan on approaching backwards compat? 
I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP
chars wouldn't be lowercased.  To rely on that would also be relying on undocumented behavior.

It seems like this (handling code points beyond the BMP) is really a collection of issues
that could be committed independently when ready?

Moving to 3.1 since it requires Java 1.5 for much of it.
If there's something desirable to slip in for 2.9 that doesn't depend on Java5, we can still
do that.



> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they
don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize()
use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters
should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message