lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Lucene and UTF-8
Date Wed, 21 Sep 2005 18:53:54 GMT
On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote:

> import java.util.Arrays;
> ...
> Arrays.equals(array1, array2);

Great, thank you, Chris.

The patch for is done.  It will now write valid  
UTF-8.  Older versions of Lucene will not be able to read indexes  
written using this class, as they will choke if they encounter a null  
byte or a 4-byte UTF-8 sequence.

As an added bonus, this patch yields a speedup of a couple percentage  
points (on my machine), made possible by simplified conditionals.   
For instance, the first if() clause...

     if (code >= 0x01 && code <= 0x7F) now...

     if (code < 0x80)

The new class is sort of done.  It has all the  
tests Ken suggested, though I think it could stand the addition of a  
randomized test to excite edge cases.  The data mirrors the data from, and that's by design, as I think with so much  
overlap the two ought to be merged.  How does "" grab  
you all?

On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote:

> a. Single surrogate pair (two Java chars)
> b. Surrogate pair at the beginning, followed by regular data.
> c. Surrogate pair at the end, followed by regular data.
> d. Two surrogate pairs in a row.
> Then all of the above, but remove the second (low-order) surrogate  
> character (busted format).
> Then all of the above, but replace the first (high-order) surrogate  
> character.

A minor wrinkle: each unpaired surrogate will have to be replaced by  
the Unicode replacement character U+FFFD, or the VInt count will be  
off.  This means that a UTF-16LE sequence will grow by a code point,  
as the (mis-ordered) surrogate pair (representing a single code  
point), will get subbed out for two replacement characters.   I don't  
think this is serious, though.

> Then all of the above, but replace the surrogate pair with an xC0  
> x80 encoded null byte.

I left this out of the test cases for IndexOutput (it's in there, and  
important, for IndexInput).  The UTF-16 sequence "\u00C0\u0080"  
doesn't map to a null, so I used the regular UTF-16 null "\u0000".   
As before, I think this is what you intended.

Files and patches can be found here:

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message