lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?
Date Thu, 19 Oct 2017 16:37:00 GMT


Robert Muir commented on LUCENE-8000:

and just to iterate a bit more on why position count can be a can of worms: It means lucene
would behave differently/inconsistently depending on language in many cases (or even different
minor encoding differences). Some languages may inflect a word to make it plural, and a stemmer
strips it. Otherwise might use a postposition that gets remove by the stopfilter, etc. 

Today this is all consistent either way since neither suffixes stripped by stemmers, nor stopwords,
nor artificial synonyms count towards the length. So we measure length based on the "important
content" according to the user's selected analyzer.  

The avg document length calculation is just an approximation for a pivot value, and that same
pivot is used for *all documents*. Because of that, I don't think there will be huge wins
in trying to be pedantic about how its exact value is computed. It will never be exact since
individual document's lengths are truncated to a single byte and the average wouldn't reflect
such truncation. Nevertheless its a protected method so you can override the implementation
if you don't trust it works and want to do something different.

> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>                 Key: LUCENE-8000
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Christoph Goller
>            Priority: Minor
> Length of individual documents only counts the number of positions of a document since
discountOverlaps defaults to true.
>  {quote} @Override
>   public final long computeNorm(FieldInvertState state) {
>     final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap()
: state.getLength();
>     int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
>     if (indexCreatedVersionMajor >= 7) {
>       return SmallFloat.intToByte4(numTerms);
>     } else {
>       return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
>     }
>   }{quote}
> Measureing document length this way seems perfectly ok for me. What bothers me is that
> average document length is based on sumTotalTermFreq for a field. As far as I understand
that sums up totalTermFreqs for all terms of a field, therefore counting positions of terms
including those that overlap.
> {quote}  protected float avgFieldLength(CollectionStatistics collectionStats) {
>     final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
>     if (sumTotalTermFreq <= 0) {
>       return 1f;       // field does not exist, or stat is unsupported
>     } else {
>       final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc()
: collectionStats.docCount();
>       return (float) (sumTotalTermFreq / (double) docCount);
>     }
>   }{quote}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious effect. It
just means that documents that have synonyms or in our case different normal forms of tokens
on the same position are shorter and therefore get higher scores  than they should and that
we do not use the whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message