lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: BlendedTermQuery causing negative IDF?
Date Tue, 19 Apr 2016 14:22:11 GMT
Hi Again,

For those who are interested, I uploaded BM25's Term Frequency graph [0] for some common and
content-bearing words.


[0] http://2.1m.yt/PgUEcZ.png

Ahmet




On Tuesday, April 19, 2016 5:16 PM, Ahmet Arslan <iorixxx@yahoo.com.INVALID> wrote:


Hi Markus,

It is a known property of BM25. It produces negative scores for common terms.
Most of the term-weighting models are developed for indices in which stop words are eliminated.
Therefore, most of the term-weighting models have problems scoring common terms.
By the way, DFI model does a decent job when handling common terms.

Ahmet



On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <markus.jelsma@openindex.io> wrote:
Hello,

I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 similarity and
i have a very simple unit test to see if something is working at all. But to my surprise,
one of the results has a negative score, caused by a negative IDF because docFreq is higher
than docCount for that term on that field. Here are the test documents:

    assertU(adoc("id", "1", "text", "rare term"));
    assertU(adoc("id", "2", "text_nl", "less rare term"));
    assertU(adoc("id", "3", "text_nl", "rarest term"));
    assertU(commit());

My query parser creates the following Lucene query: BlendedTermQuery(Blended(text:rare text:term
text_nl:rare text_nl:term)) which looks fine to me. But this is what i am getting back for
issueing that query on the above set of documents, the third document is the one with a negative
score.

<result name="response" numFound="3" start="0" maxScore="0.1805489">
  <doc>
    <str name="id">3</str>
    <float name="score">0.1805489</float></doc>
  <doc>
    <str name="id">2</str>
    <float name="score">0.14785346</float></doc>
  <doc>
    <str name="id">1</str>
    <float name="score">-0.004004207</float></doc>
</result>
<lst name="debug">
  <str name="rawquerystring">{!blended fl=text,text_nl}rare term</str>
  <str name="querystring">{!blended fl=text,text_nl}rare term</str>
  <str name="parsedquery">BlendedTermQuery(Blended(text:rare text:term text_nl:rare
text_nl:term))</str>
  <str name="parsedquery_toString">Blended(text:rare text:term text_nl:rare text_nl:term)</str>
  <lst name="explain">
    <str name="3">
0.1805489 = max plus 0.01 times others of:
  0.1805489 = weight(text_nl:term in 2) [], result of:
    0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
), product of:
      0.18232156 = idf(docFreq=2, docCount=2)
      0.9902773 = tfNorm, computed from:
        1.0 = termFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        2.5 = avgFieldLength
        2.56 = fieldLength
</str>
    <str name="2">
0.14785345 = max plus 0.01 times others of:
  0.14638956 = weight(text_nl:rare in 1) [], result of:
    0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
      0.18232156 = idf(docFreq=2, docCount=2)
      0.8029196 = tfNorm, computed from:
        1.0 = termFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        2.5 = avgFieldLength
        4.0 = fieldLength
  0.14638956 = weight(text_nl:term in 1) [], result of:
    0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
      0.18232156 = idf(docFreq=2, docCount=2)
      0.8029196 = tfNorm, computed from:
        1.0 = termFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        2.5 = avgFieldLength
        4.0 = fieldLength
</str>
    <str name="1">
-0.004004207 = max plus 0.01 times others of:
  -0.20021036 = weight(text:rare in 0) [], result of:
    -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
      -0.22314355 = idf(docFreq=2, docCount=1)
      0.89722675 = tfNorm, computed from:
        1.0 = termFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        2.0 = avgFieldLength
        2.56 = fieldLength
  -0.20021036 = weight(text:term in 0) [], result of:
    -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
      -0.22314355 = idf(docFreq=2, docCount=1)
      0.89722675 = tfNorm, computed from:
        1.0 = termFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        2.0 = avgFieldLength
        2.56 = fieldLength
</str>

What am i doing wrong? Or did i catch a bug?

Thanks,
Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message