lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dale Richardson <tigerqu...@outlook.com>
Subject Re: nested documents performance anomaly
Date Sun, 14 Apr 2019 10:58:35 GMT
Hi Roi,
My understanding of how the nested relationship is implemented in Lucene is that the child
document references are physically stored in the same index segment as the parent document
reference.  For normal queries which index segment a document reference is stored in is completely
transparent to the query result, but the block join operator used for parent-child joins takes
advantage of this low-level detail to provide for super-fast joins between parent and child
documents.  A trade off for this technique is that the relevant index segment needs to be
re-written when any part of the parent-child relationship changes.

I suspect that if you are writing all the children documents for a parent, you are helpfully
batching up all updates to a single index segment into a single update, with the subsequent
increase in speed.

The constraints that apply in return for this speed boost is that you must have all the children
document ready to write in one go, and the index updates are likely done in a single transaction
for each parent (i.e. all or none).  I suspect (but have not tested the fact) that indexing/storing
1000 child documents to a 1000 parent documents one document at a time would actually be slower
than just indexing 1 million documents 1 document at a time.

I hope this increases your understanding of the situation.

Regards,
Dale.
________________________________
From: Roi Wexler <Roi.Wexler@wdc.com>
Sent: Sunday, 14 April 2019 6:59 AM
To: dev@lucene.apache.org
Subject: nested documents performance anomaly


Hi,
we're at the process of testing Solr for its indexing speed which is very impotent to our
application.
we've witnessed strange behavior that we wish to understand before using it.
when we indexed 1M docs it took about 63 seconds but when we indexed the same documents only
now we've nested them as 1000 parented with 1000 child documents each, it took only 27 seconds.

we know that Lucene don't support nested documents for it has a flat object model, and we
do see that in fact it does index each of the child documents as a separate document.

we have tests shows that we get the same results in case we index all documents flat (without
childs) or when we index them as 1000 parents with 1000 nested documents each.

do we miss something here?
why does it behave like that?
what kind of constraints does child documents have, or what is the price we pay to get this
better index speed?
we're trying to establish if this is a valid way to get a better performance in index speed..

any help will be appreciated.



Mime
View raw message