lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: is omitNorms still valid?
Date Tue, 22 Aug 2017 10:36:05 GMT
Yes, LUCENE-7730 is the issue.

Le mar. 22 août 2017 à 12:00, Koji Sekiguchi <koji.sekiguchi@rondhuit.com>
a écrit :

> I thought LUCENE-6819 removed the single byte float as well because to
> describe the background of
> the ticket, you mentioned it was poor precision. So I thought the ticket
> solved it (from the context).
>
> So the field length is still stored in the single byte and the precision
> of the float still not
> good? And the point of the LUCENE-6819 is that we can set more precise
> boost value if we want
> because it no longer depends on the poor precision single byte for field
> length?
>

We still use a single byte in order to store the norm. The difference is
that before we used to store ${index-boost} * ${length-norm}. Because
index-boosts could take any positive value, we could not make any
assumptions about this quantity that could have helped make storage more
efficient. More concretely, length-norm was always between 0 and 1, so if
you did not use index boosts like most Lucene users, then the final
normalization factor would be in 0-1 as well. Yet only 125 out of the 256
bytes that the SmallFloat encoding that we used represent values between 0
and 1. So this feature was trading accuracy of the length normalization
factor in favor of a feature that was only used by a minority and could be
easily replaced by a doc-value field.

We actually went a bit further and started storing the document length
rather than the precomputed length-normalization factor in the norms field.
It is easier to reason about since we know all values are integers,
positive, and that we want to have better accuracy for lower values. This
allowed to encode lengths accurately up to 40, while the previous encoding
that we used considered 3 and 4 to be the same lengths for instance. Then
accuracy degrades progressively as you can notice on the LUCENE-7730 ticket.

Mime
View raw message