lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno Roustant (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit
Date Tue, 09 Apr 2019 08:14:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813106#comment-16813106
] 

Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:13 AM:
----------------------------------------------------------------

It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially
- a little less for UniformSplit, then I had an exception about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term
and Phrase queries. I didn't understand why a different behavior between a small and a large
index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop
and a 64 GB desktop. Actually I got nearly no difference in my test.
 * Much larger index could mean more results. So the time spent to score and rank the results
could become much larger and diminish the effect of a change in the dictionary. I have no
clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

(I used -Jira option, but it does not seem to recognize the \{color} tag)
||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff||
|Fuzzy1|72.81|3.11|21.77|0.71|{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|{color:red}2%\{color}-\{color:green}6%\{color}|


was (Author: bruno.roustant):
It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially
- a little less for UniformSplit, then I had an exception about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term
and Phrase queries. I didn't understand why a different behavior between a small and a large
index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop
and a 64 GB desktop. Actually I got nearly no difference in my test.
 * Much larger index could mean more results. So the time spent to score and rank the results
could become much larger and diminish the effect of a change in the dictionary. I have no
clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff||||
|Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}|

> New PostingFormat - UniformSplit
> --------------------------------
>
>                 Key: LUCENE-8753
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8753
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.0
>            Reporter: Bruno Roustant
>            Priority: Major
>         Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to access the
block, but not as a prefix trie, rather with a seek-floor pattern. For the selection of the
blocks, there is a target average block size (number of terms), with an allowed delta variation
(10%) to compare the terms and select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more compact and speed
up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, comparing UniformSplit
with BlockTree. Find it in the first comment and also attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time is reduced
by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
> This initial version passes all Lucene tests. Use “ant test -Dtests.codec=UniformSplitTesting”
to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we have already
exercised this PostingsFormat extensibility to create a different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message