lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <koji.sekigu...@rondhuit.com>
Subject Re: is omitNorms still valid?
Date Wed, 23 Aug 2017 01:41:25 GMT
Hi Adrien,

Thank you for the great explanation!

Koji


On 2017/08/22 19:36, Adrien Grand wrote:
> Yes, LUCENE-7730 is the issue.
> 
> Le mar. 22 août 2017 à 12:00, Koji Sekiguchi <koji.sekiguchi@rondhuit.com 
> <mailto:koji.sekiguchi@rondhuit.com>> a écrit :
> 
>     I thought LUCENE-6819 removed the single byte float as well because to describe the
background of
>     the ticket, you mentioned it was poor precision. So I thought the ticket solved it
(from the
>     context).
> 
>     So the field length is still stored in the single byte and the precision of the float
still not
>     good? And the point of the LUCENE-6819 is that we can set more precise boost value
if we want
>     because it no longer depends on the poor precision single byte for field length?
> 
> 
> We still use a single byte in order to store the norm. The difference is that before
we used to 
> store ${index-boost} * ${length-norm}. Because index-boosts could take any positive value,
we could 
> not make any assumptions about this quantity that could have helped make storage more
efficient. 
> More concretely, length-norm was always between 0 and 1, so if you did not use index
boosts like 
> most Lucene users, then the final normalization factor would be in 0-1 as well. Yet only
125 out of 
> the 256 bytes that the SmallFloat encoding that we used represent values between 0 and
1. So this 
> feature was trading accuracy of the length normalization factor in favor of a feature
that was only 
> used by a minority and could be easily replaced by a doc-value field.
> 
> We actually went a bit further and started storing the document length rather than the
precomputed 
> length-normalization factor in the norms field. It is easier to reason about since we
know all 
> values are integers, positive, and that we want to have better accuracy for lower values.
This 
> allowed to encode lengths accurately up to 40, while the previous encoding that we used
considered 3 
> and 4 to be the same lengths for instance. Then accuracy degrades progressively as you
can notice on 
> the LUCENE-7730 ticket.
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message