lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-7304) Doc values based block join implementation
Date Thu, 02 Jun 2016 06:51:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311833#comment-15311833
] 

Paul Elschot edited comment on LUCENE-7304 at 6/2/16 6:51 AM:
--------------------------------------------------------------

Instead of the patch it might be simpler to try and let EliasFanoDocIdSet extend from BitSet,
even though it cannot implement MutableBits.
There is a dilemma here: either introduce DocBlocksIterator, or not implement MutableBits.

The question is which one would be preferable in the long term for the block join queries:
DocBlocksIterator or BitSet?
DocBlocksIterator is read only and might involve a little overhead.
BitSet implements mutability but that is not needed for the block join queries.




was (Author: paul.elschot@xs4all.nl):
It might be simpler to try and let EliasFanoDocIdSet extend from BitSet, even though it cannot
implement MutableBits.
There is a dilemma here: either introduce DocBlocksIterator, or not implement MutableBits.

The question is which one would be preferable in the long term for the block join queries:
DocBlocksIterator or BitSet?
DocBlocksIterator is read only and might involve a little overhead.
BitSet implements mutability but that is not needed for the block join queries.



> Doc values based block join implementation
> ------------------------------------------
>
>                 Key: LUCENE-7304
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7304
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous parent doc during
advancing the doc id iterator. On large indices these bitsets can consume large amounts of
jvm heap space.  Also typically due the nature how these bitsets are set, the 'FixedBitSet'
implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field that stores
offsets. Each child doc stores how many docids it is from its parent doc and each parent stores
how many docids it is apart from its first child. At query time this information can be used
to perform the block join.
> I think another benefit of this approach is that external tools can now easily determine
if a doc is part of a block of documents and perhaps this also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message