lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: RLE Compressing bit vectors, just toughts
Date Sun, 05 Aug 2007 08:35:19 GMT

>In the latest Matcher patches I forgot to consider your implementation of
>a Matcher on OpenBitSet. 
If I remember well, Matcher based on BitSetIterator (another of Yonik's pearls :) is something
to consider, my Matcher on OpenBitSet is no better than yours.

>As a sort of baseline for good compression you might try and use
>the GZipInputStream in java.util.gzip. I just saw that it even

I believe it is not going to work, java compression support is good for compressing files
or larger compact blocks of bytes, but when you need something a bit more "exotic", eg.  random
access like in this case, static Huffman (or any entropy coder), static dictionary for deflate...
it fails. 

In this concrete case you would need for every single doc id list to initialize new deflater,
what is IMHO prohibitively slow. (skip you mentioned is just scanning byte by byte without
directly repositioning Stream)

it would work for really huge collections, where this constant time factor gets relatively

Also, this way we are not going to reduce the number of VInt decoding  (by far the most popular
method call in Lucene :) 

So far, my best bet would be to do it directly on index, every time you see  a lot of deltas
 == 1, you try to encode them with some 0xFF(run length). On decoding, next() and skipTo()
should not be all that complex (I hope). Must see how it can cooperate with skipping info.

Yahoo! Answers - Got a question? Someone out there knows the answer. Try it

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message