lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter
Date Sun, 14 Jul 2013 20:12:48 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708113#comment-13708113
] 

Adrien Grand commented on LUCENE-5101:
--------------------------------------

bq. Do WAH8 and PFOR already have an index?

They do, but the index is naive: it is a plain binary search over a subset of the (docID,position)
pairs contained in the set. With the first versions of these DocIdSets, I just wanted to guarantee
O(log(cardinality)) advance performance.

bq. Block decoding might still be added to EliasFano, which should improve its nextDoc() performance

The main use-case I see for these sets is to be used as filters. So I think advance() performance
is more important?

bq. The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc()
is less than a factor two worse than PFOR.

Well, the PFOR doc ID set is not tuned either. :-) But I agree this is a good surprise for
the Elias-Fano set. I mean even the WAH8 doc id set should be pretty fast and is still slower
than the Elias-Fano set.

bq. Another surprise is that Elias-Fano is best at advance() among the compressed sets for
some cases. That means that Long.bitCount() is doing well on the upper bits then.

I'm looking forward for the index. :-)

bq. For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode
the inverse set. Could that be done by a common wrapper?

I guess so.
                
> make it easier to plugin different bitset implementations to CachingWrapperFilter
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-5101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5101
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-5101.patch
>
>
> Currently this is possible, but its not so friendly:
> {code}
>   protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) throws IOException
{
>     if (docIdSet == null) {
>       // this is better than returning null, as the nonnull result can be cached
>       return EMPTY_DOCIDSET;
>     } else if (docIdSet.isCacheable()) {
>       return docIdSet;
>     } else {
>       final DocIdSetIterator it = docIdSet.iterator();
>       // null is allowed to be returned by iterator(),
>       // in this case we wrap with the sentinel set,
>       // which is cacheable.
>       if (it == null) {
>         return EMPTY_DOCIDSET;
>       } else {
> /* INTERESTING PART */
>         final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
>         bits.or(it);
>         return bits;
> /* END INTERESTING PART */
>       }
>     }
>   }
> {code}
> Is there any value to having all this other logic in the protected API? It seems like
something thats not useful for a subclass... Maybe this stuff can become final, and "INTERESTING
PART" calls a simpler method, something like:
> {code}
> protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) {
>   final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
>   bits.or(iterator);
>   return bits;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message