lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Turnbull <dturnb...@opensourceconnections.com>
Subject Re: BlendedTermQuery causing negative IDF?
Date Tue, 19 Apr 2016 14:32:59 GMT
Lucene's BM25 avoids negatives scores for this by adding 1 inside the log
term of BM25's IDF

Compare this:
https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L71

to the Wikipedia article's BM25 IDF
https://en.wikipedia.org/wiki/Okapi_BM25

Markus another thing to add is that when Elasticsearch uses
BlendedTermQuery, they add a lot of invariants that must be true. For
example the fields must share the same analyzer. You may need to research
what else happens in Elasticsearch outside BlendedTermQuery to fet this
behavior to work.

Another testing philosophy point: when I do this kind of work I like to
isolate the Lucene behavior seperate from the Solr behavior. I might
suggest creating a Lucene unit test to validate your assumptions around
BlendedTermQuery. Just to help isolate the issues. Here's Lucene's tests
for BlendedTermQuery as a basis

https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/test/org/apache/lucene/search/TestBlendedTermQuery.java








On Tue, Apr 19, 2016 at 10:16 AM Ahmet Arslan <iorixxx@yahoo.com.invalid>
wrote:

>
>
> Hi Markus,
>
> It is a known property of BM25. It produces negative scores for common
> terms.
> Most of the term-weighting models are developed for indices in which stop
> words are eliminated.
> Therefore, most of the term-weighting models have problems scoring common
> terms.
> By the way, DFI model does a decent job when handling common terms.
>
> Ahmet
>
>
>
> On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <
> markus.jelsma@openindex.io> wrote:
> Hello,
>
> I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using
> BM25 similarity and i have a very simple unit test to see if something is
> working at all. But to my surprise, one of the results has a negative
> score, caused by a negative IDF because docFreq is higher than docCount for
> that term on that field. Here are the test documents:
>
>     assertU(adoc("id", "1", "text", "rare term"));
>     assertU(adoc("id", "2", "text_nl", "less rare term"));
>     assertU(adoc("id", "3", "text_nl", "rarest term"));
>     assertU(commit());
>
> My query parser creates the following Lucene query:
> BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term))
> which looks fine to me. But this is what i am getting back for issueing
> that query on the above set of documents, the third document is the one
> with a negative score.
>
> <result name="response" numFound="3" start="0" maxScore="0.1805489">
>   <doc>
>     <str name="id">3</str>
>     <float name="score">0.1805489</float></doc>
>   <doc>
>     <str name="id">2</str>
>     <float name="score">0.14785346</float></doc>
>   <doc>
>     <str name="id">1</str>
>     <float name="score">-0.004004207</float></doc>
> </result>
> <lst name="debug">
>   <str name="rawquerystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="querystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="parsedquery">BlendedTermQuery(Blended(text:rare text:term
> text_nl:rare text_nl:term))</str>
>   <str name="parsedquery_toString">Blended(text:rare text:term
> text_nl:rare text_nl:term)</str>
>   <lst name="explain">
>     <str name="3">
> 0.1805489 = max plus 0.01 times others of:
>   0.1805489 = weight(text_nl:term in 2) [], result of:
>     0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.9902773 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         2.56 = fieldLength
> </str>
>     <str name="2">
> 0.14785345 = max plus 0.01 times others of:
>   0.14638956 = weight(text_nl:rare in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
>   0.14638956 = weight(text_nl:term in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
> </str>
>     <str name="1">
> -0.004004207 = max plus 0.01 times others of:
>   -0.20021036 = weight(text:rare in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
>   -0.20021036 = weight(text:term in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
> </str>
>
> What am i doing wrong? Or did i catch a bug?
>
> Thanks,
> Markus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message