lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <>
Subject Re: is omitNorms still valid?
Date Wed, 23 Aug 2017 01:41:25 GMT
Hi Adrien,

Thank you for the great explanation!


On 2017/08/22 19:36, Adrien Grand wrote:
> Yes, LUCENE-7730 is the issue.
> Le mar. 22 août 2017 à 12:00, Koji Sekiguchi < 
> <>> a écrit :
>     I thought LUCENE-6819 removed the single byte float as well because to describe the
background of
>     the ticket, you mentioned it was poor precision. So I thought the ticket solved it
(from the
>     context).
>     So the field length is still stored in the single byte and the precision of the float
still not
>     good? And the point of the LUCENE-6819 is that we can set more precise boost value
if we want
>     because it no longer depends on the poor precision single byte for field length?
> We still use a single byte in order to store the norm. The difference is that before
we used to 
> store ${index-boost} * ${length-norm}. Because index-boosts could take any positive value,
we could 
> not make any assumptions about this quantity that could have helped make storage more
> More concretely, length-norm was always between 0 and 1, so if you did not use index
boosts like 
> most Lucene users, then the final normalization factor would be in 0-1 as well. Yet only
125 out of 
> the 256 bytes that the SmallFloat encoding that we used represent values between 0 and
1. So this 
> feature was trading accuracy of the length normalization factor in favor of a feature
that was only 
> used by a minority and could be easily replaced by a doc-value field.
> We actually went a bit further and started storing the document length rather than the
> length-normalization factor in the norms field. It is easier to reason about since we
know all 
> values are integers, positive, and that we want to have better accuracy for lower values.
> allowed to encode lengths accurately up to 40, while the previous encoding that we used
considered 3 
> and 4 to be the same lengths for instance. Then accuracy degrades progressively as you
can notice on 
> the LUCENE-7730 ticket.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message