lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Character encoding per index.
Date Mon, 12 Dec 2005 15:40:40 GMT
Hello list,

I'm looking for a way to change character encoding per index. It  
feels silly to store chinese characters in 3 bytes using UTF-8 when  
it is possible to do it with 2 bytes using UTF-16. By just hacking  
the IndexInput and IndexOutput I quick and dirty got it all running  
in UTF-16, but this is not good enough since I have other indexes  
that is more optimized when encoded in UTF-8.

The character encoding of Lucene today is quite static. In order to  
select encoding it seems to me I have to do some major refactoring to  
the project, passing a character codec from my analyzer (or perhaps  
IndexWriter/Reader) all the way down to the IndexInput/Output via  
TermVector/Info, et.c.

Can someone think of a better way to set character encoding per  
index? Or perhaps some other thought?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message