lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5084) EliasFanoDocIdSet
Date Tue, 02 Jul 2013 22:12:20 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13698319#comment-13698319
] 

Paul Elschot commented on LUCENE-5084:
--------------------------------------

bq.  maybe we should have a static utility method to check that so that consumers of this
API can opt for a FixedBitSet if their doc set is going to be dense?

We could, but in which class? For example, in CachingWrapperFilter it might be good to save
memory, so it could be there.
Also, would the expected size be the only thing to check for? When decoding speed is also
important, other DocIdSets might be preferable.


bq.  the ceil of the log in base 2 is computed through a loop
numberOfLeadingZeros is indeed better than a loop. We need the Long variant here.

bq. use PackedInts.getMutable to store the low-order bits instead of a raw long[]
Can PackedInts.getMutable also be used in a codec? Longs are needed for the high bits, see
below, and the high and low bits can be conveniently stored next to each other in an index.

bq.  shouldn't the iterator's getCost method return efDecoder.numValues instead of efEncoder.numValues?
Yes.

bq. Maybe we could just support the encoding of monotonically increasing sequences of ints
to make things simpler?

I considered a decoder that returns ints but it that would require a lot more casting in the
decoder.
Decoding the unary encoded high bits is best done on longs, so mixing longs and ints in encoder
is not really an option.
We could pass the actual NO_MORE_VALUES to be used as an argument to the decoder, would that
help?

As to why decoding the unary encoded high bits is best done on longs, see Algorithm 2 in "Broadword
Implementation of Rank/Select Queries", Sebastiano Vigna, January 30, 2012, http://vigna.di.unimi.it/ftp/papers/Broadword.pdf
.
I also have an initial java implementation of that, but it is not used here yet, there are
only a few comments in the code here that it might be used. I'll open another issue for broadword
bit selection later.




                
> EliasFanoDocIdSet
> -----------------
>
>                 Key: LUCENE-5084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5084
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Paul Elschot
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-5084.patch
>
>
> DocIdSet in Elias-Fano encoding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message