lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: bytecount as String and prefix length
Date Tue, 01 Nov 2005 00:31:14 GMT
I wrote...

> I think I'll take a crack at a custom charsToUTF8 converter algo.

Still no luck.  Still 20% slower than the current implementation.   
The algo is below, for reference.

It's entirely possible that my patches are doing something dumb  
that's causing this, given my limited experience with Java.  But if  
that's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate buffer  
before blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on the  
length VInt yields a significant benefit over the standard techniques  
for reading in UTF-8.  That wouldn't be hard to believe.  Without  
that number, there's a lot of guesswork involved.  English requires  
about 1.1 bytes per UTF-8 code point; Japanese, 3.  Multiple memory  
allocation ops may be required as bytes get read in, especially if  
the final String object kicked out HAS to use the bare minimum amount  
of memory.  I don't suppose there's any way for me to snoop just  
what's happening under the hood in these CharsetDecoder classes or  
String constructors, is there?

Scanning through a SegmentTermEnum with next() doesn't seem to be any  
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs  
benchmarker doesn't slow down that much when IndexInput is changed to  
use a String constructor that accepts UTF-8 bytes rather than chars.   
However, it's possible that the modified toTerm method of TermBuffer  
is a bottleneck, as it also uses the UTF-8 String constructor.  It  
doesn't get exercised under, but during  
merging of segments I believe it sees plenty of action -- maybe a lot  
more than IndexInput's readString.

So my next step is to write a utf8ToString method that's as efficient  
as I can make it.  After that... I dunno, I'm running out of ideas.

Marvin Humphrey
Rectangular Research

   public static final ByteBuffer stringToUTF8(
         String s, int start, int length, ByteBuffer byteBuf) {
     int i = start;
     int j = 0;
     try {
       final int end = start + length;
       byte[] bytes = byteBuf.array();
       for ( ; i < end; i++) {
         final int code = (int)s.charAt(i);
         if (code < 0x80)
           bytes[j++] = (byte)code;
         else if (code < 0x800) {
           bytes[j++] = (byte)(0xC0 | (code >> 6));
           bytes[j++] = (byte)(0x80 | (code & 0x3F));
         } else if (code < 0xD800 || code > 0xDFFF) {
           bytes[j++] = (byte)(0xE0 | (code >>> 12));
           bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
           bytes[j++] = (byte)(0x80 | (code & 0x3F));
         } else {
           // surrogate pair
           int utf32;
           // confirm valid high surrogate
           if (code < 0xDC00 && (i < end-1)) {
             utf32 = ((int)s.charAt(i+1));
             // confirm valid low surrogate and write pair
             if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
               utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
               bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
               bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
               bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
               bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
           // replace unpaired surrogate or out-of-order low surrogate
           // with substitution character
           bytes[j++] = (byte)0xEF;
           bytes[j++] = (byte)0xBF;
           bytes[j++] = (byte)0xBD;
     catch (ArrayIndexOutOfBoundsException e) {
       // guess how many more bytes it will take, plus 10%
       float charsProcessed = (float)(i - start);
       float bytesPerChar = (j / charsProcessed) * 1.1f;

       float charsLeft = length - charsProcessed;
       float targetSize
         = (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;

       return stringToUTF8(s, start, length, ByteBuffer.allocate((int) 
     return byteBuf;

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message