lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Lucene and UTF-8
Date Wed, 21 Sep 2005 20:45:33 GMT

On Sep 21, 2005, at 12:25 PM, Yonik Seeley wrote:

> How does this patch work w.r.t. the length vint?
> It looks like the length is still the number of 16 bit java chars,
> but the encoding is now correct UTF-8?

Yes.  As Ken Krugler pointed out to me, the issues can be separated.   
The length VInt can be changed now or in the future.

There may be lots of reasons to change the length VInt to use bytes;  
IIRC, you were one of the people inclined in that direction.   
(Another possibility was to use UTF-8 characters, but there doesn't  
seem to be any advantage in going that route besides aesthetic  
harmony.)  The decision to change it or not to change it will have to  
be taken after a festive round of benchmarking.

If nobody steps up to do that benchmarking, I'll probably try to  
kickstart the discussion with a little of my own, as it would be much  
better for the Perl side to use bytes as the length VInt, no  
question.  But since I'm basically an army of one working the Perl  
angle right now, it would be great if I didn't have to stretch myself  
even thinner doing benchmarking in Java when there are a lot more  
people with a lot more expertise who can take that on.

Perl development is going very well, by the way.  On the indexing  
side, I've got a new app going which solves both the index  
compatibility issue and the speed issue, about which I'll make a  
presentation in this forum after I flesh it out and clean it up.

Well, I'm lying a little.  The app doesn't quite write a valid Lucene  
1.4.3 index, since it writes true UTF-8.  If these patches get  
adopted prior to the release of 1.9, though, it will write valid  
Lucene 1.9 indexes.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message