lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5084) EliasFanoDocIdSet
Date Wed, 03 Jul 2013 09:12:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13698767#comment-13698767
] 

Adrien Grand commented on LUCENE-5084:
--------------------------------------

bq. We could, but in which class? For example, in CachingWrapperFilter it might be good to
save memory, so it could be there.

This new doc id set might be used for other use-cases in the future, so maybe we should have
this method on the EliasFanoDocIdSet class?

bq. Also, would the expected size be the only thing to check for? When decoding speed is also
important, other DocIdSets might be preferable.

Sure, this is something we need to give users control on. For filter caches, it is already
possible to override CachingWrapperFilter.docIdSetToCache to decide whether speed or memory
usage is more important. The decision can even depend on the cardinality of the set to cache
or on its implementation. So we just need to provide users with good defaults I think?

I haven't run performance benchmarks on this set implementation yet, but if it is faster than
the DocIdSets iterators of our default postings format, then they are not going to be a bottleneck
and I think it makes sense to use the implementation that saves the most memory. If they are
slower or not faster enough, then maybe other implementations such as kamikaze's p-for-delta-based
doc ID sets (LUCENE-2750) would make more sense as a default.

bq. Can PackedInts.getMutable also be used in a codec?

The PackedInts API can return readers that can read directly from an IndexInput if this is
the question but if we want to be able to store high and low bits contiguously then they are
not going to be a good fit.

bq. I considered a decoder that returns ints but that would require a lot more casting in
the decoder.

OK. I just wanted to have your opinion on this, we can keep everything as a long.

bq. I'll open another issue for broadword bit selection later.

Sounds good! I think backwards iteration and efficient skipping should be done in separate
issues as well, even without them this new doc ID set would be a very nice addition.
                
> EliasFanoDocIdSet
> -----------------
>
>                 Key: LUCENE-5084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5084
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Paul Elschot
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-5084.patch
>
>
> DocIdSet in Elias-Fano encoding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message