lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4100) Maxscore - Efficient Scoring
Date Tue, 10 Oct 2017 09:38:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-4100:
---------------------------------
    Attachment: LUCENE-4100.patch

I have been exploring how we could make this optimization a bit less inconvenient by not requiring
that indices be static or codec-level changes, even though this might make the optimization
a bit less efficient. Here is a patch that demonstrates the idea:
 - {{SimScorer.maxScore()}} returns the maximum score that may be produced by SimScorer.score().
The BM25 impl returns {{IDF * (k1 + 1)}}. POSITIVE_INFINITY is a fine return value in case
a SimScorer doesn't have an upper bound on the produced scores.
 - {{Scorer.maxScore()}} returns the maximum score that may be produced by Scorer.score().
For TermScorer, this method delegates to SimScorer.maxScore().
 - {{Scorer.setMinCompetitiveScore(float)}} is used to tell the scorer that documents that
produce scores that are less than the given score _may_ safely be ignored. This method is
called by TopScoreDocCollector whenever the top of the priority queue is updated. Scorer impls
are free to ignore it.
 - There is a new {{MaxScoreScorer}}, which only works on pure disjunctions and is largely
inspired from {{MinShouldMatchSumScorer}} except that instead of ensuring that freq >=
minShouldMatch, it tries to ensure that sum_of_max_scores >= minCompetitiveScore.

It is largely untested, but luceneutil finds the same top docs with and without the patch
so I expect it to not be completely broken (I had to disable the check that hit counts are
the same since Maxscore doesn't give total hit counts). It yields interesting speedups on
wikimedium10M, even for {{OrHighHigh}}:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct
diff
   HighTermDayOfYearSort      102.75      (4.8%)       57.26      (3.7%)  -44.3% ( -50% -
 -37%)
                 LowTerm      708.14      (5.0%)      696.51      (5.4%)   -1.6% ( -11% -
   9%)
              HighPhrase       68.06      (4.0%)       67.00      (3.7%)   -1.6% (  -8% -
   6%)
            HighSpanNear        2.84      (3.4%)        2.80      (3.6%)   -1.3% (  -7% -
   5%)
             MedSpanNear       24.12      (2.5%)       23.90      (2.6%)   -0.9% (  -5% -
   4%)
               MedPhrase       49.14      (2.8%)       48.81      (2.8%)   -0.7% (  -6% -
   5%)
             LowSpanNear       20.48      (2.8%)       20.36      (2.9%)   -0.6% (  -6% -
   5%)
               LowPhrase       24.11      (2.6%)       24.01      (2.5%)   -0.4% (  -5% -
   4%)
              AndHighMed      316.86      (2.7%)      315.66      (2.7%)   -0.4% (  -5% -
   5%)
                 MedTerm      334.25      (3.1%)      333.01      (3.2%)   -0.4% (  -6% -
   6%)
                 Respell      261.62      (3.4%)      260.86      (5.5%)   -0.3% (  -8% -
   8%)
             AndHighHigh       62.76      (2.2%)       62.59      (2.4%)   -0.3% (  -4% -
   4%)
                HighTerm       86.17      (3.9%)       86.24      (2.8%)    0.1% (  -6% -
   6%)
                  IntNRQ       15.03      (6.5%)       15.20      (5.2%)    1.1% (  -9% -
  13%)
         LowSloppyPhrase       99.72      (3.3%)      100.83      (3.3%)    1.1% (  -5% -
   8%)
                 Prefix3       34.06      (7.2%)       34.48      (5.0%)    1.3% ( -10% -
  14%)
                Wildcard      159.32      (8.4%)      161.37      (6.8%)    1.3% ( -12% -
  18%)
         MedSloppyPhrase       42.17      (3.2%)       42.81      (3.0%)    1.5% (  -4% -
   7%)
              AndHighLow     1059.03      (3.6%)     1075.73      (2.9%)    1.6% (  -4% -
   8%)
            OrNotHighLow     1774.47      (4.1%)     1805.36      (4.7%)    1.7% (  -6% -
  10%)
        HighSloppyPhrase       32.84      (4.0%)       33.46      (3.4%)    1.9% (  -5% -
   9%)
            OrNotHighMed      295.65      (6.4%)      309.03      (4.7%)    4.5% (  -6% -
  16%)
       HighTermMonthSort      227.22      (9.9%)      241.42      (8.7%)    6.2% ( -11% -
  27%)
           OrNotHighHigh       43.19      (3.2%)       48.56      (3.1%)   12.4% (   5% -
  19%)
            OrHighNotMed       93.61      (3.5%)      113.32      (4.8%)   21.1% (  12% -
  30%)
           OrHighNotHigh       13.58      (3.2%)       16.44      (4.9%)   21.1% (  12% -
  30%)
                  Fuzzy1      190.81      (5.2%)      236.11     (14.4%)   23.7% (   3% -
  45%)
            OrHighNotLow       68.12      (3.9%)       85.99      (6.1%)   26.2% (  15% -
  37%)
              OrHighHigh       37.43      (5.4%)       51.96      (5.0%)   38.8% (  26% -
  52%)
               OrHighMed       55.45      (5.2%)      130.78      (9.0%)  135.8% ( 115% -
 158%)
                  Fuzzy2       24.33      (4.6%)       81.51     (24.2%)  235.0% ( 197% -
 276%)
               OrHighLow       32.37      (4.4%)      451.34     (47.8%) 1294.2% (1189% -
1408%)
{noformat}

TODO:
 - docs, tests
 - Add a {{boolean needsTotalHits}} to {{Query.createWeight}} and {{Collector}} similarly
to the existing {{boolean needsScores}}.
 - How should we deal with TopDocs instances that don't have correct total hit counts? Set
it to {{-1}}? Or maybe {{-1 - num_collected_docs}} so that users can get a lower bound of
the number of matches?

Follow-ups:
 - Require that scores are always positive or null so that such optimizations are easier to
reason about and generalize in the future?
 - Call the {{setMinCompetitiveScore}} API from {{TopFieldDocCollector}} too when the first
sort field is the score.
 - Propagate {{Scorer.setMinCompetitiveScore(float)}} whenever possible so that eg. filtered
disjunctions could benefit from the optimization too.
 - Implement {{Scorer.setMinCompetitiveScore(float)}} on other scorers.
 - Store more metadata in order to be able to compute better higher bounds for the scores,
like the maximum term frequency. Maybe even allow similarity-specific stats?

> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, gsoc2014, patch, performance
>             Fix For: 4.9, 6.0
>
>         Attachments: LUCENE-4100.patch, contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient algorithm first
published in the IR domain in 1995 by H. Turtle & J. Flood, that I find deserves more
attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with example queries
and lucenebench, the package of Mike McCandless, resulting in very significant speedups.
> This ticket is to get started the discussion on including the implementation into Lucene's
codebase. Because the technique requires awareness about it from the Lucene user/developer,
it seems best to become a contrib/module package so that it consciously can be chosen to be
used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message