lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: stored fields / unicode compression
Date Fri, 09 Jan 2009 01:26:26 GMT

Catching up on my holiday email, I on't think there were any replies to 
this question yet.  

The low level file formats used by Lucene is an area I don't have 
time/expertise to follow carefully, but if i'm remember correctly the 
concensus is/was to more more towards pure (byte[] data, int start, int 
end) based APIs for efficiency, with "String" based APIs provided as 
syntactic sugar via a facade, and deprecating the existing "internal" gzip 
compression in favor of similar "external" compression facades.  So 
something like you describe could be done as is using the byte[] 
interfaces *and* be generally useful to others.

Taking a step back to look at the broader picture, this is the kind of 
thing that in Solr could be implemented as a new FieldType

: Date: Fri, 26 Dec 2008 19:00:11 -0500
: From: Robert Muir
: Subject: stored fields / unicode compression
: Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for
: stored fields?
: Personally I don't put huge amounts of text in stored fields but these
: encodings/compression work extremely well on short strings like titles, etc.
: Removing the unicode penalty for non-latin text (i.e. cut in half) is
: nothing to sneeze at since with lots of docs my stored fields still become
: pretty huge, biggest part of the index.
: I know I could use one of these schemes right now and store everything as
: bytes... but just thinking it might be something of more general use. The
: GZIP compression that is supported isn't very useful as it typically makes
: short snippets bigger...
: Performance compared to UTF-8 is here... seems like a general win to me (but
: maybe I am missing something)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message