lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-5084) EliasFanoDocIdSet
Date Tue, 02 Jul 2013 19:31:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13698122#comment-13698122
] 

Adrien Grand edited comment on LUCENE-5084 at 7/2/13 7:29 PM:
--------------------------------------------------------------

I had a deeper look at the patch, here are some thoughts about it:
 - the patch looks great, really, I especially like all the integrity checks!
 - after having read more of it, I think moving it to oal.util.packed makes perfect sense
 - the documentation mentions the fact that a FixedBitSet will be more storage efficient if
{{numValues >= (upperBound/3)}}, maybe we should have a static utility method to check
that so that consumers of this API can opt for a FixedBitSet if their doc set is going to
be dense?
 - in EliasFanoEncoder constructor, the ceil of the log in base 2 is computed through a loop,
can we use Integer.numberOfLeadingZeros instead (as explained in http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#numberOfLeadingZeros(int)
)
 - I think it would make sense to use PackedInts.getMutable to store the low-order bits instead
of a raw long[] (this abstraction will hide the complexity of the bit packing or even use
a more appropriate structure, eg. if the number of bits required is 8, it will use a byte[])
 - I'm not sure but shouldn't the iterator's getCost method return efDecoder.numValues instead
of efEncoder.numValues?

Even though storing longs is more general, it forces us to use a different value to encode
NO_MORE_DOCS which adds a little complexity to the DocIdSet implementation. Maybe we could
just support the encoding of monotonically increasing sequences of ints to make things simpler?
(just thinking out loud).
                
      was (Author: jpountz):
    I had a more deeper look at the patch, here are some thoughts about it:
 - the patch looks great, really, I especially like all the integrity checks!
 - after having read more of it, I think moving it to oal.util.packed makes perfect sense
 - the documentation mentions the fact that a FixedBitSet will be more storage efficient if
{{numValues >= (upperBound/3)}}, maybe we should have a static utility method to check
that so that consumers of this API can opt for a FixedBitSet if their doc set is going to
be dense?
 - in EliasFanoEncoder constructor, the ceil of the log in base 2 is computed through a loop,
can we use Integer.numberOfLeadingZeros instead (as explained in http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#numberOfLeadingZeros(int)
)
 - I think it would make sense to use PackedInts.getMutable to store the low-order bits instead
of a raw long[] (this abstraction will hide the complexity of the bit packing or even use
a more appropriate structure, eg. if the number of bits required is 8, it will use a byte[])
 - I'm not sure but shouldn't the iterator's getCost method return efDecoder.numValues instead
of efEncoder.numValues?

Even though storing longs is more general, it forces us to use a different value to encode
NO_MORE_DOCS which adds a little complexity to the DocIdSet implementation. Maybe we could
just support the encoding of monotonically increasing sequences of ints to make things simpler?
(just thinking out loud).
                  
> EliasFanoDocIdSet
> -----------------
>
>                 Key: LUCENE-5084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5084
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Paul Elschot
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-5084.patch
>
>
> DocIdSet in Elias-Fano encoding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message