lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring
Date Fri, 13 Oct 2017 16:45:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203829#comment-16203829
] 

Adrien Grand commented on LUCENE-4100:
--------------------------------------

bq. indexsearcher already knows if scores are needed (e.g. from the Sort), but there is no
way to tell it that approximate total hit count is acceptable. If we can do that, then I think
we can make the early termination case really easy for the sorted case, index order case,
and also this maxscore case.

+1 We need to add a new {{boolean needsTotalHits}} to the {{search(Query, int)}} and {{search(Query,
int, Sort)}} methods.

bq. we do? [...] I'm just asking because the new scorer here looks a hell of a lot like a
disjunction scorer

Well it is a disjunction. :) Our regular disjunction scorer maintains a single heap and only
looks at the 'top' element and callso {{updateTop}} after calls to {{nextDoc}} or {{advance}}.
If you want to be able to call {{advance()}} sometimes on low-scoring clauses even if only
{{nextDoc()}} was called on the disjunction, you need to give it the ability to leave some
scorers behind, as long as the sum of the max scores of scorers that are behind and scorers
that are positioned on the current candidate is not greater than the minimum competitive score
(otherwise you might be missing matches). This means you need to move scorers between at least
2 data-structures. In practice, we actually use 3 of them so that we can also easily differenciate
scorers that are on the current candidate from scorers that are too advanced. Moving scorers
between those data-structures has some overhead. For instance here is what I get when I benchmark
the MaxScoreScorer against master's DisjunctionSumScorer (BS1 is disabled in both cases):

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct
diff
   HighTermDayOfYearSort       33.34      (7.3%)       16.55      (2.6%)  -50.4% ( -56% -
 -43%)
              OrHighHigh       21.91      (4.1%)       12.36      (1.7%)  -43.6% ( -47% -
 -39%)
               OrHighLow       48.08      (4.0%)       29.27      (2.4%)  -39.1% ( -43% -
 -34%)
               OrHighMed       60.66      (4.0%)       39.32      (2.5%)  -35.2% ( -40% -
 -29%)
                  Fuzzy2      117.30      (7.1%)      101.36      (7.5%)  -13.6% ( -26% -
   1%)
                  Fuzzy1      238.83     (10.7%)      212.80      (7.3%)  -10.9% ( -26% -
   7%)
            OrHighNotLow       70.64      (3.4%)       66.82      (2.0%)   -5.4% ( -10% -
   0%)
            OrHighNotMed       65.59      (3.0%)       62.63      (1.6%)   -4.5% (  -8% -
   0%)
           OrNotHighHigh       34.55      (2.3%)       33.24      (1.1%)   -3.8% (  -6% -
   0%)
           OrHighNotHigh       48.98      (2.4%)       47.17      (1.4%)   -3.7% (  -7% -
   0%)
            OrNotHighMed      107.24      (2.1%)      103.40      (1.1%)   -3.6% (  -6% -
   0%)
                  IntNRQ       26.26     (11.5%)       25.64     (12.3%)   -2.3% ( -23% -
  24%)
              AndHighLow      879.33      (3.7%)      860.16      (3.3%)   -2.2% (  -8% -
   4%)
              AndHighMed      168.90      (1.7%)      165.48      (1.3%)   -2.0% (  -4% -
   1%)
              HighPhrase       20.08      (2.8%)       19.69      (2.7%)   -2.0% (  -7% -
   3%)
         MedSloppyPhrase       15.44      (1.7%)       15.15      (1.8%)   -1.9% (  -5% -
   1%)
             LowSpanNear       44.70      (2.1%)       43.88      (1.9%)   -1.8% (  -5% -
   2%)
               MedPhrase       52.07      (3.2%)       51.16      (3.1%)   -1.7% (  -7% -
   4%)
         LowSloppyPhrase      150.90      (1.5%)      148.62      (1.5%)   -1.5% (  -4% -
   1%)
            OrNotHighLow     1174.47      (3.6%)     1157.52      (4.1%)   -1.4% (  -8% -
   6%)
            HighSpanNear       38.46      (3.0%)       37.92      (2.4%)   -1.4% (  -6% -
   4%)
        HighSloppyPhrase       28.13      (2.4%)       27.76      (2.5%)   -1.3% (  -6% -
   3%)
               LowPhrase      131.67      (1.6%)      130.37      (2.0%)   -1.0% (  -4% -
   2%)
             MedSpanNear       54.75      (3.4%)       54.22      (3.0%)   -1.0% (  -7% -
   5%)
                Wildcard       41.58      (5.1%)       41.23      (5.2%)   -0.8% ( -10% -
   9%)
             AndHighHigh       71.08      (1.5%)       70.61      (1.3%)   -0.7% (  -3% -
   2%)
                 Prefix3       88.11      (6.4%)       87.57      (7.5%)   -0.6% ( -13% -
  14%)
                 MedTerm      235.73      (1.4%)      235.35      (3.7%)   -0.2% (  -5% -
   4%)
                HighTerm      144.06      (1.3%)      143.93      (4.2%)   -0.1% (  -5% -
   5%)
                 LowTerm      711.49      (4.2%)      718.15      (4.7%)    0.9% (  -7% -
  10%)
                 Respell      203.69      (8.6%)      206.72      (7.0%)    1.5% ( -12% -
  18%)
       HighTermMonthSort      196.30      (5.2%)      222.54      (8.2%)   13.4% (   0% -
  28%)
{noformat}

bq.  I do like that in your patch AssertingScorer checks all that stuff, but there it is a
bit confusing.

Can you elaborate on what you find confusing? This looks similar to how you should not call
{{Score.score()}} if you passed {{needsScores=false}} to me?

> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, gsoc2014, patch, performance
>             Fix For: 4.9, 6.0
>
>         Attachments: LUCENE-4100.patch, LUCENE-4100.patch, contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient algorithm first
published in the IR domain in 1995 by H. Turtle & J. Flood, that I find deserves more
attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with example queries
and lucenebench, the package of Mike McCandless, resulting in very significant speedups.
> This ticket is to get started the discussion on including the implementation into Lucene's
codebase. Because the technique requires awareness about it from the Lucene user/developer,
it seems best to become a contrib/module package so that it consciously can be chosen to be
used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message