lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: bytecount as String and prefix length
Date Tue, 01 Nov 2005 06:36:12 GMT

On Oct 31, 2005, at 5:15 PM, Robert Engels wrote:

> All of the JDK source is available via download from Sun.

Thanks.  I believe the UTF-8 coding algos can be found in...

j2se > src > share > classes > sun > nio > cs > UTF_8.java

It looks like the translator methods have fairly high loop overheads,  
since they have to keep track of the member variables of ByteBuffer  
and CharBuffer objects and prepare to return result objects on each  
loop iter.  Also, they have robust error-checking for malformed  
source data, which Lucene traditionally has not.  The algo below my  
sig should be faster.

I wrote...

> So my next step is to write a utf8ToString method that's as efficient
> as I can make it.

Ok, this time we made a little headway.  We're down from 20% slower  
to around 10% slower indexing than current implementation.  But I  
don't see how I'm going to get it any faster.  There's maybe one  
conditional in FieldsReader that can be simplified.

There's another downside to the way I'm implementing this right now.   
The byteBuf and charBuf have to be kept somewhere.  Currently, I'm  
allocating a ByteBuffer for each TermInfosWriter and a charBuf for  
each TermBuffer.  That's something of a memory hit, though it's hard  
to say exactly how much.  IndexInput and IndexOutput are still using  
the Sun methods -- when I gave them Buffers, they slowed down.

I've got one more idea... time to try overriding readString and  
writeString in BufferedIndexInput and BufferedIndexOutput, to take  
advantage of buffers that are already there.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

//----------------------------------------------------------------

   public static final CharBuffer utf8ToChars (
         byte[] bytes, int start, int length, CharBuffer charBuf) {
     int i = start;
     int j = 0;
     final int end = start + length;
     char[] chars = charBuf.array();
     try {
       while (i < end) {
         byte b = bytes[i++];
         switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
           case 0:
             chars[j++] = (char)(b & 0x7F);
             break;
           case 1:
             chars[j++] = (char)(((b & 0x1F) << 6)
               | (bytes[i++] & 0x3F));
             break;
           case 2:
             chars[j++] = (char)(((b & 0x0F) << 12)
               | ((bytes[i++] & 0x3F) << 6)
               |  (bytes[i++] & 0x3F));
             break;
           case 3:
             int utf32 = (((b & 0x0F) << 18)
               | ((bytes[i++] & 0x3F) << 12)
               | ((bytes[i++] & 0x3F) << 6)
               |  (bytes[i++] & 0x3F));
             chars[j++] = (char)((utf32 >> 10) + 0xD7C0);
             i++;
             chars[j++] = (char)((utf32 & 0x03FF) + 0xDC00);
             break;
         }
       }
     }
     catch (ArrayIndexOutOfBoundsException e) {
       float bytesProcessed = (float)(i - start);
       float bytesPerChar = (j / bytesProcessed) * 1.1f;

       float bytesLeft = length - bytesProcessed;
       float targetSize = (float)chars.length + bytesPerChar *  
bytesLeft + 1.0f;
       return utf8ToChars(bytes, start, length, CharBuffer.allocate 
((int)targetSize));
     }
     charBuf.position(j);
     return charBuf;
   }



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message