lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ayse Onalan <>
Subject early termination with query time sorting (but without index-time SortingMergePolicy)
Date Fri, 19 Apr 2019 20:29:27 GMT
Hi Lucene users,
I'm looking into ways to improve the filtered and sorted searches over our untokenized fields.
For instance filter by fields A and B (could be term, prefix, range) and sort by A value.

Lucene scores and collects the documents in DocId order. Normally DocIds are assigned incrementally
as you add documents to the index. So they are random with respect to doc value ordinals for
a particular field. Hence when a query matches more documents in a segment than the page size,
Lucene still has to score and collect *all* documents and sort at the end to decide what documents
should constitute the top page. When the number of matching documents are way more than the
page size, this has an impact on perf.

When we use SortedMergePolicy, merged segments would have DocIds in the same order as the
sort field(s) of the merge policy. If there is a query with a sort spec consistent with the
segment's sort order, docs would be scored and collected in the sort property value order,
and hence can be early terminated once page size is reached. This yields significant perf-gains.
However index time sorting is limited to only one sort spec. We need the ability to sort efficiently
using more than one sort spec.

I want to understand what it would take to allow early termination with query time sorting
but without index-time sorting. Or what issues would prevent us from doing so.
At a high level, it appears theoretically possible to me with the following set of changes.
I want to check with the community for suggestions, whether I am overlooking/missing some
problems this would face, or whether this seems viable:

  1.  We would need to be able to get the docIds for sorted doc values.

Lucene already exposes a FieldCache that is enumerable by sorted DocValues. It exposes DocId->DocValue
and DocId->ordinal mapping but not the other way around.

We'd need to change FieldCache to also keep a reverse map of ordinal->DocId and expose
it via a new api. This makes possible to get the DocId associated with the current DocValue/ordinal
during enumeration.

  1.  We would have to change query scorers to be aware of the desired docId order (that would
come from the field cache of the sort property) so they can score and collect in that order.

Currently all DocItSetIterators allow advancing forward in DocId. We would have to change

  1.  TermScorer uses postings reader's BlockDocsEnum as its DocIdSetIterator.

DocIds for each term is kept as a block, each block keeping DocIds in order. It also maintains
multi-level skip lists to be able to scan to a targetDoc Id efficiently. These skip lists
maintain only forward pointers.

So we would have to write a codec - postings writer - that also keeps multilevel skip lists
with *backwards* pointers.

And we would have to write a postings reader with a BlockDocsEnum that allows moving backwards
and forward when seeking DocIds (using the backwards multilevel skip lists).

  1.  For scorers that directly use field cache filters, the underlying DocIdSetIterator is
a FixedBitSet. It allows only moving forward using words and bit operators. We would need
to change that to go in either direction. This still would be efficient given FixedBitSet
is an in-memory bitmap structure.
  2.  Once the building blocks above are there, we would need to change queries and scorers
to remove any assumptions of DocIds strictly moving forward.

For instance FilteredQuery uses a leap frog strategy that makes strict assumptions that DocIds
only move forward.

ConjunctionScorer makes the same assumption in the way it evaluates and progresses each individual

I am relatively new to the lucene world and would appreciate thoughts and feedback on this.
Thank you,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message