lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: [jira] Updated: (LUCENE-584) Decouple Filter from BitSet
Date Thu, 07 Sep 2006 21:30:19 GMT

"What's the point of using a sorted interval list for a category?"

Just terminology first to avoid misunderstanding :), category is "category field" that can
take N valus

Now, the case I am facing goes as follows:

I have category field in 50Mio collection which has more or less uniform distribution of values,

imagine 100 values (terms) each with 0.5Mio documents. when I sort index on it, bit vectors
look like

So, if I cache this in any other representation (Filter usage, no scoring contribution as
typical for categories) this adds up in memory rather quickly. With interval list for Filter/Matcher
I need 2 Ints to store cache entry per value/term and have really fast next()-SkipTo(). In
total 200 Ints for sorted category field with 100 values....

Not an unusual case I guess, a lot of people can sort their indexes, I guess at least.

Of course, it is a bit simplified, as we should not forget dynamic nature of the index. In
real life, you can afford to sort your index once per month or once per week, but small percentage
of updates do not introduce significant problems for interval lists.

Does this make sense to you, or I misread your aproach?

What woud be sexy to do, is to use such representation for postings storage, so each postings
list would have type field that determines optimal bit vector representation for the term
(high level, did not look into it yet in details). So instead of VInts, we could have interval
lists for such cases. This would be  equivalent to RLE  encoding... which can be done fast
I guess. This field type could be determined in some kind of *oprional* post-optimize proces
that scanns postings and tries to find optimal "compresion", only VInt, VInt with rle for
"predominantly sorted" postings....  ANd this is not 
limited to rle only, I could imagine some smart people to come up with even smarter things...

It looks like good idea to me, but I would not be surprised if somebody proves that I missed
the point totaly :)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message