lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging
Date Sun, 01 May 2016 12:57:13 GMT


Robert Muir commented on LUCENE-7253:

If postings list compression is used for example, then see here:

This is really easy to understand: it optimizes for runs of the same value. if that value
is 500 or zero it does not care. This was added to the postings list code because it helps
real use cases. 

If similar compression is used, sparse cases will work better without being explicitly optimized
for. This is what I am asking for: that we optimize for *real* use cases only.

> Sparse data in doc values and segments merging 
> -----------------------------------------------
>                 Key: LUCENE-7253
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 5.5, 6.0
>            Reporter: Pawel Rog
>              Labels: performance
> Doc Values were optimized recently to efficiently store sparse data. Unfortunately there
is still big problem with Doc Values merges for sparse fields. When we imagine 1 billion documents
index it seems it doesn't matter if all documents have value for this field or there is only
1 document with value. Segment merge time is the same for both cases. In most cases this is
not a problem but there are several cases in which one can expect having many fields with
sparse doc values.
> I can describe an example. During performance tests of a system with large number of
sparse fields I realized that Doc Values merges are a bottleneck. I had hundreds of different
numeric fields. Each document contained only small subset of all fields. Average document
contains 5-7 different numeric values. As you can see data was very sparse in these fields.
It turned out that ingestion process was CPU-bound. Most of CPU time was spent in DocValues
related methods (SingletonSortedNumericDocValues#setDocument, DocValuesConsumer$10$1#next,
DocValuesConsumer#isSingleValued, DocValuesConsumer$4$1#setNext, ...) - mostly during merging
> Adrien Grand suggested to reduce the number of sparse fields and replace them with smaller
number of denser fields. This helped a lot but complicated fields naming. 
> I am not much familiar with Doc Values source code but I have small suggestion how to
improve Doc Values merges for sparse fields. I realized that Doc Values producers and consumers
use Iterators. Let's take an example of numeric Doc Values. Would it be possible to replace
Iterator which "travels" through all documents with Iterator over collection of non empty
values? Of course this would require storing object (instead of numeric) which contains value
and document ID. Such an iterator could significantly improve merge time of sparse Doc Values
fields. IMHO this won't cause big overhead for dense structures but it can be game changer
for sparse structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
>     dvConsumer.addNumericField(fieldInfo,
>                                new Iterable<Number>() {
>                                  @Override
>                                  public Iterator<Number> iterator() {
>                                    return new NumericIterator(maxDoc, values, docsWithField);
>                                  }
>                                });
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
>     // Fill in any holes:
>     for (int i = (int)pending.size(); i < docID; ++i) {
>       pending.add(MISSING);
>     }
> {code}
> It turns out that variable called pending is used only internally in NumericDocValuesWriter.
I know pending is PackedLongValues and it wouldn't be good to change it with different class
(some kind of list) because this may break DV performance for dense fields. I hope someone
can suggest interesting solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can start here.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message