lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] Commented: (LUCENE-1689) supplementary character handling
Date Sat, 13 Jun 2009 14:23:07 GMT


Simon Willnauer commented on LUCENE-1689:

The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing
1.5 code. 
I think we need to act on this issue asap and release it together with 3.0. -> ful support
for unicode 4.0 in lucene 3.0 
I also thought about the implementation a little bit. The need to detect chars > BMP and
operate on those might be spread out across lucene (quite a couple of analyzers and filters
etc). Performance could truely suffer from this if it is done "wrong" or even more than once.
It might be considerable to make the detection pluggable with an initial filter that only
checks where surrogates are present in a token and sets an indicator to the token represenation
so that subsequent TokenStreams can operate on it without rechecking. This would also preserve
performance for those who do not need chars > BMP (which could be quite a large amout of
people). Those could then simply not supply such a initial filter.

Just a couple of random thoughts.

> supplementary character handling
> --------------------------------
>                 Key: LUCENE-1689
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1689_lowercase_example.txt
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they
don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize()
use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters
should be the exception, but still work.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message