lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Character encoding per index.
Date Mon, 12 Dec 2005 18:14:21 GMT
On Dec 12, 2005, at 10:04 AM, karl wettin wrote:

> 12 dec 2005 kl. 16.40 skrev karl wettin:
>> Hello list,
>> I'm looking for a way to change character encoding per index. It  
>> feels silly to store chinese characters in 3 bytes using UTF-8  
>> when it is possible to do it with 2 bytes using UTF-16. By just  
>> hacking the IndexInput and IndexOutput I quick and dirty got it  
>> all running in UTF-16, but this is not good enough since I have  
>> other indexes that is more optimized when encoded in UTF-8.
>> The character encoding of Lucene today is quite static. In order  
>> to select encoding it seems to me I have to do some major  
>> refactoring to the project, passing a character codec from my  
>> analyzer (or perhaps IndexWriter/Reader) all the way down to the  
>> IndexInput/Output via TermVector/Info, et.c.

On a side note, this is another issue that I believe can be addressed  
by using a bytecount instead of a charcount at the head of Lucene's  

A byte-based TermBuffer needn't care what encoding the Strings are in.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message