lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Thacker <va...@vthacker.in>
Subject Re: Do we leverage index sort for filters?
Date Thu, 05 Mar 2020 21:28:14 GMT
Thanks Adrien for the background

IndexSortSortedNumericDocValuesRangeQuery is a neat idea! I imagine the
logs use-case where every search has a filter makes this optimization
important.

In https://github.com/apache/lucene-solr/pull/715 the benchmark indexed
123M docs. The results for - *range with single point [897303051,
897303051], 124 docs *showed a slight slowdown over what we have originally.
However the matching documents were very small compared to the total docs.

I created another dataset locally where I indexed 5M docs with 10 different
unique values for the filtering field.

*Query 1:*
Query longPointFq = LongPoint.newExactQuery("category", 1);


*Query 2:*
Query fallbackQuery =
SortedNumericDocValuesField.newSlowRangeQuery("category_dv", 1, 1);
IndexSortSortedNumericDocValuesRangeQuery optimizedFq = new
IndexSortSortedNumericDocValuesRangeQuery("category_dv", 1, 1,
fallbackQuery);

Ran each query 1000 times and recorded the total time
Query 1 took 3300ms
Query 2 took 150ms

The numbers were pretty consistent on running it a couple of times.

Curious to hear your thoughts on trying to use this optimization for exact
queries as well


On Thu, Mar 5, 2020 at 7:59 AM Adrien Grand <jpountz@gmail.com> wrote:

> We don't directly take advantage of index sort in this case, but index
> sorting still makes this faster. I had mentioned it in a presentation a
> couple years ago
> https://speakerdeck.com/elastic/get-the-lay-of-the-lucene-land-1?slide=14:
> querying geonames for TYPE:CITY AND CONTRY_CODE_US ran 1.6x faster when the
> index is sorted by TYPE then CONTRY_CODE.
>
> There are two contributing factors to it. The first one is that postings
> are cheaper to decode, because they consist of long range of doc IDs that
> increment by 1. The second is that having filters that match dense range of
> doc IDs is a better case for ConjunctionDISI than combining iterators whose
> doc IDs are interleaved.
>
> We have a single query that takes advantage of index sorting explicitly to
> my knowledge: IndexSortSortedNumericDocValuesRangeQuery. This query runs
> range queries on numbers using doc values by binary searching the doc IDs
> that map to the start and the end of the interval.
>
> On Thu, Mar 5, 2020 at 12:56 AM Varun Thacker <varun@vthacker.in> wrote:
>
>> If I have an index sorted by category and at search time filter on one
>> category
>>
>> Do we currently take advantage of index sort for this sort of a filter
>> query?
>>
>>
>
> --
> Adrien
>

Mime
View raw message