lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shayan Tabrizi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-7480) Wrong Formula in LMDirichletSimilarity
Date Thu, 06 Oct 2016 21:01:20 GMT
Shayan Tabrizi created LUCENE-7480:
--------------------------------------

             Summary: Wrong Formula in LMDirichletSimilarity
                 Key: LUCENE-7480
                 URL: https://issues.apache.org/jira/browse/LUCENE-7480
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Shayan Tabrizi


It seems that LMDirichletSimilarity only calculates "score" method if the term occurs in the
document. Otherwise, in line 389 of BooleanWeight (Lucene 6.2.0) subScorer becomes null, and
thus the clause is not added to the optional list in order to be scored.

However, in the original formula of LM (http://www.stat.uchicago.edu/~lafferty/pdf/smooth-tois.pdf,
formula 6), we have "n log a_d" (n is the number of query terms). Therefore, even for the
query terms not present in the document a "log a_d" must be added to the final score.

But the implementation of LMDirichletSimilarity adds "log a_d" to the score in the "score"
method, and therefore it is only added to the final score for the query terms present in the
document.

This can worsen the retrieval results compared to the correct formula. I tried to correct
this for myself but because of the plenty of "final" methods and classes, I was not successful.
Please, check the problem and solve it if approved, and also please tell me how I can correct
it before a new release is published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message