lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-7304) Doc values based block join implementation
Date Mon, 06 Jun 2016 20:04:21 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Elschot updated LUCENE-7304:
---------------------------------
    Attachment: LUCENE-7304-20160606.patch

Patch of 6 June 2016.
This is the EliasFano code from  LUCENE-5627 put into core.

This has EliasFanoSequence implemented as EliasFanoBytes and as EliasFanoLongs, and an encoder
and a decoder for these.

The EliasFanoDocIdSet uses an EliasFanoLongs except when it is dense, in that case it uses
a FixedBitSet.

I added a getBitSet() method in this EliasFanoDocIdSet.

I also added the test cases from LUCENE-5627, but I did not add a test for the getBitSet()
method yet. It works as a DocIdSet, so as a BitSet should be no problem.

EliasFanoDocIdSet could also be implemented on EliasFanoBytes, and it should be doable to
put that in an index. At LUCENE-5627 EliasFanoBytes is used as a Payload already.


> Doc values based block join implementation
> ------------------------------------------
>
>                 Key: LUCENE-7304
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7304
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, LUCENE-7304-20160606.patch,
LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous parent doc during
advancing the doc id iterator. On large indices these bitsets can consume large amounts of
jvm heap space.  Also typically due the nature how these bitsets are set, the 'FixedBitSet'
implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field that stores
offsets. Each child doc stores how many docids it is from its parent doc and each parent stores
how many docids it is apart from its first child. At query time this information can be used
to perform the block join.
> I think another benefit of this approach is that external tools can now easily determine
if a doc is part of a block of documents and perhaps this also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message