lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Messer <>
Subject Re: Binary fields and data compression
Date Wed, 01 Sep 2004 20:42:54 GMT
Doug Cutting wrote:

> Bernhard Messer wrote:
>> a few month ago, there was a very interesting discussion about field 
>> compression and the possibility to store binary field values within a 
>> lucene document. Regarding to this topic, Drew Farris came up with a 
>> patch to add the necessary functionality. I ran all the necessary 
>> tests on his implementation and didn't find one problem. So the 
>> original implementation from Drew could now be enhanced to compress 
>> the binary field data (maybe even the text fields if they are stored 
>> only) before writing to disc. I made some simple statistical 
>> measurements using the package for data compression. 
>> Enabling it, we could save about 40% data when compressing plain text 
>> files with a size from 1KB to 4KB. If there is still some interest, 
>> we could first try to update the patch, because it's outdated due to 
>> several changes within the Fields class. After finishing that, 
>> compression could be added to the updated version of the patch.
> I like this patch and support upgrading it and adding it to Lucene.
Having a single, huge patch, implementing all the functionality, seems 
to be very difficult to maintain thru Bugzilla. So i would suggest to 
split the whole implementation in maybe 3 different steps.
1) updating the binary field patch and add it to lucene
2) making FieldsReader and FieldsWriter more readable using private 
static finals and add compression
3) additional thoughts about compressing whole documents instead of 
single fields.

> I imagine a public API like:
>   public static final class Store {
>      [ ... ]
>      public static final COMPRESS = new Store();
>   }
>   new Field(String, byte[]) // stored, not compressed or indexed
>   new Field(String, byte[], Store)
> Also, in, perhaps we could replace:
>   String stringValue;
>   Reader readerValue;
>   byte[] binaryValue;
> with:
>   Object value;
> And in and, some package-private 
> constants would make the code more readable, like:
>   static final int FieldWriter.IS_TOKENIZED = 1;
>   static final int FieldWriter.IS_BINARY = 2;
>   static final int FieldWriter.IS_COMPRESSED = 4;
> Note that it makes sense to compress non-binary values.  One could use 
> String.getBytes("UTF-8") and compress that.
I'm totally with you. Compressing string values would make sense if the 
length reaches a certain size (the same for byte[]). This limit is 
something we have to figure out, what the minimum size of a compression 
candidate has to be. During my tests, i saw that everything up to 100 
bytes is a perfect candidate for compression. But there is much more 
work to do in that area.

> I wonder if it might make more sense to compress entire document 
> records, rather than individual fields.  This would probably do better 
> when documents have lots of short text fields, as is not uncommon, and 
> would also minimize the fixed compression/decompression setup costs 
> (i.e., inflator/deflator allocations).  We could instead add a 
> "isCompressed" flag to Document, and then, in Field{Reader,Writer}, 
> store a bit per document indicating whether it is compressed.  
> Document records could first be serialized uncompressed to a buffer 
> which is then compressed and written.  Thoughts?
Interesting idea. I think this strongly depends on the fields, the 
options they have and at least their values. Would it make sense to 
compress a field which is tokenized and indexed but not stored ? My be 
we could think on some kind of algorithm, checking the document fields 
setting and decide if it is a candidate for compression. Just a thought ;-)

> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message