lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8689) Boolean DocValues Codec Implementation
Date Sun, 19 May 2019 16:31:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843482#comment-16843482
] 

Erick Erickson commented on LUCENE-8689:
----------------------------------------

I haven't looked at the patch, and in particular the docValues stuff for boolean fields in
a _long_ time, so this may be totally off base. I wanted to ask how back-compat is handled?
Particularly if there are old-style and new-style segments in the index. There are still lines
like:

 
{code}
@Override
public String toInternal(String val) {
 char ch = (val!=null && val.length()>0) ? val.charAt(0) : 0;
 return (ch=='1' || ch=='t' || ch=='T') ? "T" : "F";
}
{code}
 

in BoolField for instance. 

> Boolean DocValues Codec Implementation
> --------------------------------------
>
>                 Key: LUCENE-8689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8689
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Ivan Mamontov
>            Priority: Minor
>              Labels: patch, performance
>         Attachments: LUCENE-8689.patch, LUCENE-8689.patch, SynteticDocValuesBench70.java,
SynteticDocValuesBench80.java, benchmark_dense.png, boolean_vs_dense_vs_sparse_indexing.png,
boolean_vs_dense_vs_sparse_updates.png, dense_vs_sparse_querying.png, results2.png
>
>
> To avoid issues where some products become available/unavailable at some point in time
after being out-of-stock, e-commerce search system designers need to embed up-to-date information
about inventory availability right into the search engines. Key requirement is to be able
to accurately filter out unavailable products and use availability as one of ranking signals.
However, keeping availability data up-to-date is a non-trivial task. Straightforward implementation
based on a partial updates of Lucene documents causes Solr cache trashing with negatively
affected query performance and resource utilization.
>  As an alternative solution we can use DocValues and build-in in-place updates where
field values can be independently updated without touching inverted index, and while filtering
by DocValues is a bit slower, overall performance gain is better. However existing long based
docValues are not sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into org.apache.lucene.search.DocValuesNumbersQuery
which is based on direct iteration over all column values and typically much slower than using
TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times in order
to choose the best compression algorithm suitable for given data. As a result for 4K fields
and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc values format
that uses internal representation of org.apache.lucene.util.FixedBitSet in order to store
indexed values and load them at search time as a simple long array without additional decoding.
There are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over all values
to choose the best compression algorithm suitable for given data.
>  * At query time decoding is also simple and fast, no GC pressure and extra steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for false
>  * Current implementation does not support advanced bit set formats like org.apache.lucene.util.SparseFixedBitSet
or org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark [^SynteticDocValuesBench70.java]
which allows to estimate a relative cost of DF filters. This benchmark creates 2 000 000 documents
with 5 boolean columns with different density, where 10, 35, 50, 60 and 90 is an amount of
documents with value 1. Each method tries to enumerate over all values in synthetic store
field in all available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep store
availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is used in "post
filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter store:1,
actually there is the only query implementation for DV based fields - DocValuesNumbersQuery.
This means that Lucene rewrites all term, range and filter queries for non indexed filed into
this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which support
only two values – 0/1
> !results2.png!
> Query latency is similar to FixedBitSet with negligible overhead 1-2 ms. DocValuesNumbersQuery
6-7 times slower compared to boolean query. Raw doc values iterator is also not so fast as
it performs on-the-fly decoding.
> Attached patch contains two parts:
>  * bitwise codec and all required structures and producers/consumers
>  * boolean query which removes TwoPhaseIterator, AllBits approximation and missing docs
lookup
>  * docValues codec test green except non long values cases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message