lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: RLE Compressing bit vectors, just toughts
Date Sat, 04 Aug 2007 19:09:18 GMT

Two things:

In the latest Matcher patches I forgot to consider your implementation of
a Matcher on OpenBitSet. In case I don't mention this there, please holler
there when things start moving. My implementation of that in the latest
patches is really no more than a first attempt, and mostly untested.

As a sort of baseline for good compression you might try and use
the GZipInputStream in java.util.gzip. I just saw that it even
has a skip() method over bytes of uncompressed data,
so it should also useful for skip list compression.
I have no idea how fast/slow this class is, but when it works only
slightly better than reasonable, there is probably no need to try
any compression tricks on its input, except for delta
encoding to reduce the total size of the uncompressed data stream.

Paul Elschot

On Saturday 04 August 2007 20:26, eks dev wrote:
> Hi Paul, thanks for feedback
> >I suppose you mean run length encoding? (I missed the first posting about 
> this.)
> You are right, that is what I meant by RLE. This was the first posting. I am 
just trying get some feedback to see if there are some knock-out conditions 
disqualifying  this idea.
> a bit of background, 
> It came up as a side effect after playing with Filter , better said, with 
your Matcher idea. We found some *relatively* clean way to get around BitSet 
- Filter marriage by avoiding Filter completely and making our own 
"SlicingIndexReader extends FilterIndexReader" that has ability to receive  
Matcher(filter) from outside. Also,  FilteredTermDocs implements TermDocs 
inside of it to use this "Matcher" to do skipping.   We could not wait  for  
your nice  code that solves Filter problems  to get on svn. I did not bother 
anybody with this code as I think it solves problem with Filter on 
conceptually soft ground. It provides me with capacity to make different view 
on index subset defined by Matcher, but does not have flexibility your 
approach with Matcher has (eg BooleanQuery.add(Matcher, Occur)...). Also, it 
is fast, as it filters out at "the source", with Scorers totally unaware of 
it.    .... irrelevant, just mentioning it as maybe someone finds this idea
>  useful for something else. 
> Back to the RLE, by doing all this, we came up with one 
SortedIntIntervalListMatcher  as  we have some fields in index that compress 
perfectly using this trick! IntervalList is practically  the same thing as 
RLE, solves  the same problem. So the idea, "can RLE save some ticks/ index 
size without affecting performance in typical non-sorted case", I would say 
yes, but it is good idea to ask for feedback from people more familiar with 
multi interval skip lists and bit level gurus like you and Yonik,   honestly, 
I have no idea what would be relative cost of an extra if(0xFF==b) in tight 
next() and skipTo() methods, as  this determines how big slow down in the 
worst case will be (totally sparse, no RLE "firing"). 
> >You could try and use a VInt flag value (how about 0xFF ?) to start an 
> >run length encoded series of bits. For example 0xFF would be followed by 
> >next delta as a VInt, and by the run length as the next VInt.
> >You might also try and generalize the bytes of VInt to nibbles (half 
> I will see, I need to figure out some experiments that are not as involved 
as implementing it on Lucene index (change index format and whatnot..)
> Anyhow,  it is encouraging  not to have  immediate "sorry, cannot work 
because ...."   from  you or someone else on this list :)
> thanks again for feedback,
> e.
>       ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
> now.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message