lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: BlendedTermQuery causing negative IDF?
Date Tue, 19 Apr 2016 14:29:53 GMT
Hello Ahmet,

Before the unit test with the BlendingTermQuery i am also doing a sanity check using a simple
Boolean query via LuceneQParser. The query is analogous to the BlendingTermQuery (text_nl:rare
text_nl:term) (text:rare text:term) and does not produce negative scores because the docFreq
doesn't exceed docCount. 

I'd like to try DFISimilarity and ClassicSimilarity as well, but for some reason the unit
tests do not accept the similarity defined in the test's schema.xml?!

Thanks!
Markus

 
 
-----Original message-----
> From:Ahmet Arslan <iorixxx@yahoo.com.INVALID>
> Sent: Tuesday 19th April 2016 16:17
> To: java-user@lucene.apache.org
> Subject: Re: BlendedTermQuery causing negative IDF?
> 
> 
> 
> Hi Markus,
> 
> It is a known property of BM25. It produces negative scores for common terms.
> Most of the term-weighting models are developed for indices in which stop words are eliminated.
> Therefore, most of the term-weighting models have problems scoring common terms.
> By the way, DFI model does a decent job when handling common terms.
> 
> Ahmet
> 
> 
> 
> On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <markus.jelsma@openindex.io>
wrote:
> Hello,
> 
> I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 similarity
and i have a very simple unit test to see if something is working at all. But to my surprise,
one of the results has a negative score, caused by a negative IDF because docFreq is higher
than docCount for that term on that field. Here are the test documents:
> 
>     assertU(adoc("id", "1", "text", "rare term"));
>     assertU(adoc("id", "2", "text_nl", "less rare term"));
>     assertU(adoc("id", "3", "text_nl", "rarest term"));
>     assertU(commit());
> 
> My query parser creates the following Lucene query: BlendedTermQuery(Blended(text:rare
text:term text_nl:rare text_nl:term)) which looks fine to me. But this is what i am getting
back for issueing that query on the above set of documents, the third document is the one
with a negative score.
> 
> <result name="response" numFound="3" start="0" maxScore="0.1805489">
>   <doc>
>     <str name="id">3</str>
>     <float name="score">0.1805489</float></doc>
>   <doc>
>     <str name="id">2</str>
>     <float name="score">0.14785346</float></doc>
>   <doc>
>     <str name="id">1</str>
>     <float name="score">-0.004004207</float></doc>
> </result>
> <lst name="debug">
>   <str name="rawquerystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="querystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="parsedquery">BlendedTermQuery(Blended(text:rare text:term text_nl:rare
text_nl:term))</str>
>   <str name="parsedquery_toString">Blended(text:rare text:term text_nl:rare text_nl:term)</str>
>   <lst name="explain">
>     <str name="3">
> 0.1805489 = max plus 0.01 times others of:
>   0.1805489 = weight(text_nl:term in 2) [], result of:
>     0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.9902773 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         2.56 = fieldLength
> </str>
>     <str name="2">
> 0.14785345 = max plus 0.01 times others of:
>   0.14638956 = weight(text_nl:rare in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
>   0.14638956 = weight(text_nl:term in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
> </str>
>     <str name="1">
> -0.004004207 = max plus 0.01 times others of:
>   -0.20021036 = weight(text:rare in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
>   -0.20021036 = weight(text:term in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
> </str>
> 
> What am i doing wrong? Or did i catch a bug?
> 
> Thanks,
> Markus
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message