lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Re: Character encoding per index.
Date Mon, 12 Dec 2005 18:04:40 GMT

12 dec 2005 kl. 16.40 skrev karl wettin:

> Hello list,
> I'm looking for a way to change character encoding per index. It  
> feels silly to store chinese characters in 3 bytes using UTF-8 when  
> it is possible to do it with 2 bytes using UTF-16. By just hacking  
> the IndexInput and IndexOutput I quick and dirty got it all running  
> in UTF-16, but this is not good enough since I have other indexes  
> that is more optimized when encoded in UTF-8.
> The character encoding of Lucene today is quite static. In order to  
> select encoding it seems to me I have to do some major refactoring  
> to the project, passing a character codec from my analyzer (or  
> perhaps IndexWriter/Reader) all the way down to the IndexInput/ 
> Output via TermVector/Info, et.c.
> Can someone think of a better way to set character encoding per  
> index? Or perhaps some other thought?

My current thought is to extend Directory  
(CharacterEncodingAwareDirectory or so) and all implementations of it  
to intercept the create/openFile methods and add a character encoding  
strategy to the IndexInput/Output.

Is there a reason for the write/readCharacters in IndexInput/Output  
to be final?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message