lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "juan camilo rodriguez duran (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
Date Wed, 24 Apr 2019 14:36:00 GMT


juan camilo rodriguez duran commented on LUCENE-8753:

[~rcmuir] as [~jpountz] said the last benchmark does not show the benefits of Uniform Split
as most of the query time is spent most of the time processing the postings. Just as a recap
Uniform Split shines for its simplicity and extensibility with addition of lower memory consumption
and faster segment merge.

> New PostingFormat - UniformSplit
> --------------------------------
>                 Key: LUCENE-8753
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.0
>            Reporter: Bruno Roustant
>            Assignee: David Smiley
>            Priority: Major
>         Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>          Time Spent: 10m
>  Remaining Estimate: 0h
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to access the
block, but not as a prefix trie, rather with a seek-floor pattern. For the selection of the
blocks, there is a target average block size (number of terms), with an allowed delta variation
(10%) to compare the terms and select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more compact and speed
up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, comparing UniformSplit
with BlockTree. Find it in the first comment and also attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time is reduced
by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
> This initial version passes all Lucene tests. Use “ant test -Dtests.codec=UniformSplitTesting”
to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we have already
exercised this PostingsFormat extensibility to create a different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message