lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8633) Remove term weighting from interval scoring
Date Fri, 11 Jan 2019 10:29:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740265#comment-16740265
] 

Alan Woodward commented on LUCENE-8633:
---------------------------------------

Attached is a patch with an alternative scoring system:
* Sloppy frequency is calculated as the sum of individual interval scores.  Each interval
is scored as 1/(length - minExtent + 1), where minExtent() is a new method on IntervalsSource
that exposes the minimum possible length of an interval produced by that source.  This is
based on the scoring mechanism described in Vigna's paper describing intervals[1]
* In order to keep the score bounded so that it can be used as a proximity boost without wrecking
max-score optimizations, the sloppy frequency is converted to a score using a saturation function.
 I've chosen 5 as a pivot here more-or-less at random (meaning that documents containing 5
intervals of minimum possible length will get a score of boost * 0.5) - better ways of choosing
a pivot are welcome.

[1] http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf

> Remove term weighting from interval scoring
> -------------------------------------------
>
>                 Key: LUCENE-8633
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8633
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8633.patch
>
>
> IntervalScorer currently uses the same scoring mechanism as SpanScorer, summing the IDF
of all possibly matching terms from its parent IntervalsSource and using that in conjunction
with a sloppy frequency to produce a similarity-based score.  This doesn't really make sense,
however, as it means that terms that don't appear in a document can still contribute to the
score, and appears to make scores from interval queries comparable with scores from term or
phrase queries when they really aren't.
> I'd like to explore a different scoring mechanism for intervals, based purely on sloppy
frequency and ignoring term weighting.  This should make the scores easier to reason about,
as well as making them useful for things like proximity boosting on boolean queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message