lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christoph Goller (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?
Date Thu, 19 Oct 2017 15:14:00 GMT
Christoph Goller created LUCENE-8000:
----------------------------------------

             Summary: Document Length Normalization in BM25Similarity correct?
                 Key: LUCENE-8000
                 URL: https://issues.apache.org/jira/browse/LUCENE-8000
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Christoph Goller
            Priority: Minor


Length of individual documents only counts the number of positions of a document since discountOverlaps
defaults to true.

 {quote} @Override
  public final long computeNorm(FieldInvertState state) {
    final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();
    int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
    if (indexCreatedVersionMajor >= 7) {
      return SmallFloat.intToByte4(numTerms);
    } else {
      return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
    }
  }{quote}

Measureing document length this way seems perfectly ok for me. What bothers me is that
average document length is based on sumTotalTermFreq for a field. As far as I understand that
sums up totalTermFreqs for all terms of a field, therefore counting positions of terms including
those that overlap.

{quote}  protected float avgFieldLength(CollectionStatistics collectionStats) {
    final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
    if (sumTotalTermFreq <= 0) {
      return 1f;       // field does not exist, or stat is unsupported
    } else {
      final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() :
collectionStats.docCount();
      return (float) (sumTotalTermFreq / (double) docCount);
    }
  }{quote}

Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious effect. It just
means that documents that have synonyms or in our case different normal forms of tokens on
the same position are shorter and therefore get higher scores  than they should and that we
do not use the whole spectrum of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message