lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christoph Kaser (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-7861) Hidden assumption that return value of IndexSearcher.slices is an array of continous sequential slices of the index
Date Thu, 01 Jun 2017 10:40:04 GMT
Christoph Kaser created LUCENE-7861:
---------------------------------------

             Summary: Hidden assumption that return value of IndexSearcher.slices is an array
of continous sequential slices of the index
                 Key: LUCENE-7861
                 URL: https://issues.apache.org/jira/browse/LUCENE-7861
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/search
    Affects Versions: 6.5.1, 6.0
            Reporter: Christoph Kaser


The IndexSearcher-method 
{code:java}protected LeafSlice[] slices(List<LeafReaderContext> leaves){code}
can be overwritten to customize how the index is searched with multipe threads. However, the
IndexSearcher assumes the result is an ordered array of continuous slices of the index. If
the result is "interleaved" or unordered, searchAfter may skip results.

The issue seems to be how searchAfter works vs how TopDocs.merge works:

searchAfter skips every document with a higher score than the "after" document. In case of
equal scores, it uses the document id and skips every document with a <= document id (see
PagingFieldCollector).

TopDocs.merge uses the score to determine which hits should be part of the merged TopDocs.
In case of equal scores, it uses the shard index (this corresponds to the slices the IndexSearcher
uses) to break ties (see ScoreMergeSortQueue.lessThan)

So if the shards are noncontinuous/unordered, searchAfter uses a different way of sorting
the documents than TopDocs.merge, and therefore hits are skipped.

On the mailing list, Michael McCandless suggested either improving TopDocs.merge to optionally
use the docID for tie breaking (optionally as apparently the docId is not always global for
every call of TopDocs.merge) or at least documenting the requirement on the return value of
IndexSearcher.slices().

In my use case (generating a fixed amount of slices of approximately equal size), the requirement
of ordered slices will result in a less optimal result - but I am not sure whether this has
a real impact on performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message