lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martijn van Groningen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-7304) Doc values based block join implementation
Date Tue, 07 Jun 2016 13:19:21 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Martijn van Groningen updated LUCENE-7304:
------------------------------------------
    Attachment: LUCENE_7304.patch

Changed the block join query to only require that parent docs store how far away there first
child doc is (in docids).

The reduces the amount of information required to be stored in the doc values offset field
and these offsets for the parents compress better the offset values before (which was composed
out of more information).

I tested this patch out on a test data set (https://archive.org/download/stackexchange/english.stackexchange.com.7z).
I extracted the questions, answers and comment and indexed each question with its answers
and related comments as a hierarchical block of documents. In total 745252 docs were indexed.
The size of the doc values offset field was 839592 bytes. 

After that I ran a query that selects all questions that have answers with comments (questions
-> answers -> comments) for both the current block join and doc value block join. The
the block join used 186768 bytes of jvm heap for bitsets and the doc values block join used
1132 bytes of jvm heap for references to the offset doc values field. 

So with the doc values approach, in total used roughly 4.5 times more RAM (assuming OS caches
offset field), and the jvm memory footprint was roughly 165 times smaller. 

> Doc values based block join implementation
> ------------------------------------------
>
>                 Key: LUCENE-7304
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7304
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, LUCENE-7304-20160606.patch,
LUCENE_7304.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous parent doc during
advancing the doc id iterator. On large indices these bitsets can consume large amounts of
jvm heap space.  Also typically due the nature how these bitsets are set, the 'FixedBitSet'
implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field that stores
offsets. Each child doc stores how many docids it is from its parent doc and each parent stores
how many docids it is apart from its first child. At query time this information can be used
to perform the block join.
> I think another benefit of this approach is that external tools can now easily determine
if a doc is part of a block of documents and perhaps this also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message