lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7304) Doc values based block join implementation
Date Fri, 03 Jun 2016 22:29:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314992#comment-15314992
] 

Paul Elschot commented on LUCENE-7304:
--------------------------------------

This will take a some time, I'll give a try, slowly.

I intend to make an EliasFanoDocIdSet that implements BitSet.
I think it makes sense to try and use this as a starting point for a sparse doc values implementation,
so for now I'm not opening a new issue.
Unlike normal doc values, this would allow an advance(target) implementation.

Meanwhile I realized that a doc values implementation will also have to deal with MutableBits.

bq. (... Typically it tends to be on the dense side).

Is there is a typical document block size these days?
For less than 7, an EliasFano based implementation does not really make sense, a FixedBitSet
is better there.
The bigger the block size gets than that, the more EliasFano makes sense.

For nested blocks, EliasFano can be used hierarchically, at a higher level the value in the
dictionary can be the index of the dictionary at the lower level.
Anyway, at any level there is always the possibility to use FixedBitSet or another BitSet
implementation.


> Doc values based block join implementation
> ------------------------------------------
>
>                 Key: LUCENE-7304
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7304
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous parent doc during
advancing the doc id iterator. On large indices these bitsets can consume large amounts of
jvm heap space.  Also typically due the nature how these bitsets are set, the 'FixedBitSet'
implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field that stores
offsets. Each child doc stores how many docids it is from its parent doc and each parent stores
how many docids it is apart from its first child. At query time this information can be used
to perform the block join.
> I think another benefit of this approach is that external tools can now easily determine
if a doc is part of a block of documents and perhaps this also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message