lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <>
Subject [jira] [Updated] (LUCENE-8374) Reduce reads for sparse DocValues
Date Mon, 22 Oct 2018 13:01:01 GMT


David Smiley updated LUCENE-8374:
    Fix Version/s:     (was: 7.5)
                       (was: 7.3.1)
                       (was: trunk)

> Reduce reads for sparse DocValues
> ---------------------------------
>                 Key: LUCENE-8374
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 7.5, master (8.0)
>            Reporter: Toke Eskildsen
>            Priority: Major
>              Labels: performance
>         Attachments: LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch,
LUCENE-8374.patch, LUCENE-8374_branch_7_3.patch, LUCENE-8374_branch_7_3.patch.20181005, LUCENE-8374_branch_7_4.patch,
> The {{Lucene70DocValuesProducer}} has the internal classes {{SparseNumericDocValues}}
and {{BaseSortedSetDocValues}} (sparse code path), which again uses {{IndexedDISI}} to handle
the docID -> value-ordinal lookup. The value-ordinal is the index of the docID assuming
an abstract tightly packed monotonically increasing list of docIDs: If the docIDs with corresponding
values are {{[0, 4, 1432]}}, their value-ordinals will be {{[0, 1, 2]}}.
> h2. Outer blocks
> The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values (65536), where
each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16 values) or {{SPARSE}} (< 2^12
values ~= 6%). Consequently blocks vary quite a lot in size and ordinal resolving strategy.
> When a sparse Numeric DocValue is needed, the code first locates the block containing
the wanted docID flag. It does so by iterating blocks one-by-one until it reaches the needed
one, where each iteration requires a lookup in the underlying {{IndexSlice}}. For a common
memory mapped index, this translates to either a cached request or a read operation. If a
segment has 6M documents, worst-case is 91 lookups. In our web archive, our segments has ~300M
values: A worst-case of 4577 lookups!
> One obvious solution is to use a lookup-table for blocks: A long[]-array with an entry
for each block. For 6M documents, that is < 1KB and would allow for direct jumping (a single
lookup) in all instances. Unfortunately this lookup-table cannot be generated upfront when
the writing of values is purely streaming. It can be appended to the end of the stream before
it is closed, but without knowing the position of the lookup-table the reader cannot seek
to it.
> One strategy for creating such a lookup-table would be to generate it during reads and
cache it for next lookup. This does not fit directly into how {{IndexedDISI}} currently works
(it is created anew for each invocation), but could probably be added with a little work.
An advantage to this is that this does not change the underlying format and thus could be
used with existing indexes.
> h2. The lookup structure inside each block
> If {{ALL}} of the 2^16 values are defined, the structure is empty and the ordinal is
simply the requested docID with some modulo and multiply math. Nothing to improve there.
> If the block is {{DENSE}} (2^12 to 2^16 values are defined), a bitmap is used and the
number of set bits up to the wanted index (the docID modulo the block origo) are counted.
That bitmap is a long[1024], meaning that worst case is to lookup and count all set bits for
1024 longs!
> One known solution to this is to use a [rank structure|[]].
I [implemented it|[]]
for a related project and with that (), the rank-overhead for a {{DENSE}} block would be long[32]
and would ensure a maximum of 9 lookups. It is not trivial to build the rank-structure and
caching it (assuming all blocks are dense) for 6M documents would require 22 KB (3.17% overhead).
It would be far better to generate the rank-structure at index time and store it immediately
before the bitset (this is possible with streaming as each block is fully resolved before
flushing), but of course that would require a change to the codec.
> If {{SPARSE}} (< 2^12 values ~= 6%) are defined, the docIDs are simply in the form
of a list. As a comment in the code suggests, a binary search through these would be faster,
although that would mean seeking backwards. If that is not acceptable, I don't have any immediate
idea for avoiding the full iteration.
> I propose implementing query-time caching of both block-jumps and inner-block lookups
for {{DENSE}} (using rank) as first improvement and an index-time {{DENSE}}-rank structure
for future improvement. As query-time caching is likely to be too costly for rapidly-changing
indexes, it should probably be an opt-in in solrconfig.xml.
> h2. Some real-world observations
> This analysis was triggered by massive (10x) slowdown problems with both simple querying
and large exports from our webarchive index after upgrading from Solr 4.10 to 7.3.1. The query-matching
itself takes ½-2 seconds, but returning the top-10 documents takes 5-20 seconds (~50 non-stored
DocValues fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored vs.
DocValues, so might not be directly comparable).
> Measuring with VisualVM points to {{NIOFSIndexInput.readInternal}} as *the* hotspot. 
We ran some tests with simple queries on a single 307,171,504 document segment with different
single-value DocValued fields in the fl and got
> ||Field||Type||Docs with value||Docs w/ val %||Speed in docs/sec||
> |url|String|307,171,504|100%|12,500|
> |content_type_ext|String|224,375,378|73%|360|
> |author|String|1,506,365|0.5%|1,100|
> |crawl_date|DatePoint|307,171,498|~100%|90|
> |content_text_length|IntPoint|285,800,212|93%|410|
> |content_length|IntPoint|307,016,816|99.9%|100|
> |crawl_year|IntPoint|307,171,498|~100%|14,500|
> |last_modified|DatePoint|6,835,065|2.2%|570|
> |source_file_offset|LongPoint|307,171,504|100%|28,000|
>  Note how both url and source_file_offset are very fast and also has a value for _all_
documents. Contrary to this, content_type_ext is very slow and crawl_date is extremely slow
and as they both have _nearly_ all documents, I presume they are using {{IndexedDISI#DENSE}}.
last_modified is also quite slow and presumably uses {{IndexedDISI#SPARSE}}.
> The only mystery is crawl_year which is also present in _nearly_ all documents, but is
very fast. I have no explanation for that one yet.
> I hope to take a stab at this around August 2018, but no promises.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message