lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sascha Szott <sz...@zib.de>
Subject Re: field length within BM25 score calculation in Solr 6.3
Date Thu, 15 Dec 2016 20:55:28 GMT
Hi,

bumping my question after 10 days. Any clarification is appreciated.

Best
Sascha


> Hi folks,
>
> my Solr index consists of one document with a single valued field "title" of type "text_general".
The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field type text_general
uses a StandardTokenizer which should result in 9 tokens. The corresponding length of field
title in the given document is 9.
>
> The field type is defined as follows:
>
>    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"
multiValued="true">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>        <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>
> I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.
>
> As expected, the query title:1 returns the given document. The BM25 score of the document
for the given query is 0.272.
>
> But why does Solr 6.3 states that the length of field title is 10.24?
>
> 0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
>    0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
>      0.2876821 = idf(docFreq=1, docCount=1)
>      0.94664377 = tfNorm, computed from:
>        1.0 = termFreq=1.0
>        1.2 = parameter k1
>        0.75 = parameter b
>        9.0 = avgFieldLength
>        10.24 = fieldLength
>
> In contrast, the value of avgFieldLength is correct.
>
> The same observation can be made if the index consists of two simple documents:
>
> doc1: title = 1 2 3 4
> doc2: title = 1 2 3 4 5 6 7 8
>
> The BM25 score calculation of doc2 is explained as:
>
> 0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
>    0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
>      0.18232156 = idf(docFreq=2, docCount=2)
>      0.7757405 = tfNorm, computed from:
>        1.0 = termFreq=1.0
>        1.2 = parameter k1
>        0.75 = parameter b
>        6.0 = avgFieldLength
>        10.24 = fieldLength
>
> The value of fieldLength does not match 8.
>
> Is there same "magic“ applied to the value of field length that goes beyond the standard
BM25 score formula?
>
> If so, what is the idea behind this modification. If not, is this a Lucene / Solr bug?
>
> Best regards,
> Sascha
>
>
>
>
>

-- 
Sascha Szott :: KOBV/ZIB :: +49 30 84185-457

Mime
View raw message