lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Documentation on the new compressed DocIdSet implementations
Date Tue, 17 Sep 2013 21:01:37 GMT
On Tue, Sep 17, 2013 at 1:24 PM, Smiley, David W. <> wrote:
> Lucene has got some new compressed DocIdSet implementations that are
> technically very interesting and exciting: PForDeltaDocIdSet, WAH8DocIdSet,
> EliasFanoDocIdSet, … any more?  Yet it's difficult (at least for me) to
> understand their pros/cons to know when to pick amongst them.  They all seem
> great yet why do we have 3?  Only one is actually used by Lucene itself —
> WAH8DocIdSet in CachingWrapperFilter.   Javadocs are hit & miss; the JIRA
> issues have lots of fascinating background but it's time consuming to
> distill.  I think it would be very useful to summarily document key
> characteristics on class level javadocs — not so much implementation details
> but information to help a user choose it versus another.  And as a bonus a
> table perhaps showing relative performance characteristics in package-level
> javadocs.
> Related to this is, I'm wondering does it make sense for a codec's postings
> (assuming no doc freq & no positions?) to be implemented as a serialized
> version of one of these compressed doc id sets?  I think it would be really
> great, not just for compression but also because it might support
> Terms.advance() since some of these compressed formats have indexes.

I think it makes sense; there's an issue for it: LUCENE-5052.  Also,
LUCENE-5123 (invert the PostingsFormat writing APIs) should make it
easier, since you can iterate the postings for each term more than
once, e.g. to decide in the first pass whether to encode using a
bitset or not ...

Mike McCandless

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message