lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6819) Deprecate index-time boosts?
Date Thu, 16 Feb 2017 19:27:42 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870559#comment-15870559
] 

Adrien Grand commented on LUCENE-6819:
--------------------------------------

I agree index-time and search-time boosting have different trade-offs that may both be interesting.
The problem I have is that supporting index-time boosts means that length norm is less accurate
for _everyone_. Right now if you do not use index-time boosts, which I think is the case for
a majority of users, you end up with a length norm that is between 0 and 1 ({{1/sqrt(fieldLen)}}).
The length norm may only be greater than 1 if you use a boost that is greater than 1. Out
of the 256 values that {{SmallFloat.byte315ToFloat}} supports, only 125 of them are less than
or equal to 1, the other 131 values are all greater than 1. Said otherwise, more than half
the norm values we support are wasted if you do not use index-time boosts.

If instead we could assume that norms were always between 0 and 1, we could take one bit from
the exponent and spend it on the mantissa instead to improve accuracy. For instance I rebuilt
the table that had been built for LUCENE-5005 and expanded it with a couple more length values,
as well as what the rounded norm would be if we spent 1 more bit on the mantissa (while still
being able to encode the norm on a single byte, see the float415 column):

||numTerms||1/sqrt(numTerms)||1/sqrt(numTerms) to float315||1/sqrt(numTerms) to float415||
| 1 | 1.0 | 1.0 | 1.0 |
| 2 | 0.70710677 | 0.625 | 0.6875 |
| 3 | 0.57735026 | 0.5 | 0.5625 |
| 4 | 0.5 | 0.5 | 0.5 |
| 5 | 0.4472136 | 0.4375 | 0.4375 |
| 6 | 0.4082483 | 0.375 | 0.40625 |
| 7 | 0.37796447 | 0.375 | 0.375 |
| 8 | 0.35355338 | 0.3125 | 0.34375 |
| 9 | 0.33333334 | 0.3125 | 0.3125 |
| 10 | 0.31622776 | 0.3125 | 0.3125 |
| 11 | 0.30151135 | 0.25 | 0.28125 |
| 12 | 0.28867513 | 0.25 | 0.28125 |
| 13 | 0.2773501 | 0.25 | 0.25 |
| 14 | 0.26726124 | 0.25 | 0.25 |
| 15 | 0.2581989 | 0.25 | 0.25 |
| 16 | 0.25 | 0.25 | 0.25 |
| 17 | 0.24253562 | 0.21875 | 0.234375 |
| 18 | 0.23570226 | 0.21875 | 0.234375 |
| 19 | 0.22941573 | 0.21875 | 0.21875 |
| 20 | 0.2236068 | 0.21875 | 0.21875 |

Something I really like about it is that for all length values between 1 and 9 included, you
get different values for the rounded norms. I have seen several users asking why "A B C D"
would score as well as "A B C" when the query is eg. "A" in spite of being longer, and if
we could get this addressed for short fields (think eg. product names), I think that would
be a great win.

> Deprecate index-time boosts?
> ----------------------------
>
>                 Key: LUCENE-6819
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6819
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Follow-up of this comment: https://issues.apache.org/jira/browse/LUCENE-6818?focusedCommentId=14934801&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14934801
> Index-time boosts are a very expert feature whose behaviour is tight to the Similarity
impl. Additionally users have often be confused by the poor precision due to the fact that
we encode values on a single byte. But now we have doc values that allow you to encode any
values the way you want with as much precision as you need so maybe we should deprecate index-time
boosts and recommend to encode index-time scoring factors into doc values fields instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message