lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sascha Szott <sz...@zib.de>
Subject field length within BM25 score calculation in Solr 6.3
Date Sun, 04 Dec 2016 23:26:13 GMT
Hi folks,

my Solr index consists of one document with a single valued field "title" of type "text_general".
The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field type text_general
uses a StandardTokenizer which should result in 9 tokens. The corresponding length of field
title in the given document is 9.

The field type is defined as follows:

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of the document
for the given query is 0.272. 

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
  0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
    0.2876821 = idf(docFreq=1, docCount=1)
    0.94664377 = tfNorm, computed from:
      1.0 = termFreq=1.0
      1.2 = parameter k1
      0.75 = parameter b
      9.0 = avgFieldLength
      10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
  0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
    0.18232156 = idf(docFreq=2, docCount=2)
    0.7757405 = tfNorm, computed from:
      1.0 = termFreq=1.0
      1.2 = parameter k1
      0.75 = parameter b
      6.0 = avgFieldLength
      10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the standard
BM25 score formula? 

If so, what is the idea behind this modification. If not, is this a Lucene / Solr bug?

Best regards,
Sascha





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message