I am implementing a language modelling (type) similarity function, and am
using the LMDirichletSimilarity class (and its helper classes) as a
template. However, it seems the LMDirichletSimilarity.class implementation
is not the same as that presented in "A Study of Smoothing Methods for
Language Models Applied to Information Retrieval" by Zhai and Lafferty.
The score method in LMDirichletSimilarity.class for matching terms is
implemented as follows:
score = (float) (Math.log(1 + freq / (mu * ((LMStats)
stats).getCollectionProbability())) + Math.log(mu / (docLen + mu)))
In particular, the score method in that class only provides the
normalisation factor (i.e. the Math.log(mu / (docLen + mu)) bit ) for
matching terms. It should actually do this normalisation for all terms in
the query (regardless of whether they occur in the document). The
Math.log(mu / (docLen + mu)) should really be removed and the following
documentspecific score should be added to the document score after the
termscoring part (unless I am missing some background scoring that is
going on in Lucene):
+ queryLen * Math.log(mu / (docLen + mu))
Therefore, my question is as follows:
Where in lucene can I add a documentspecific factor just prior to sorting
the final document scores? I want this to be calculated and tuneable at
querytime (not index time).
The boosting features of lucene seem to be inflexible (as they assume that
you wish to multiply the boosting factor).
I could run the initial query and then rescore the documents in the
TopDocs by adding the factor, but it seems like there has to be a more
efficient way to do this.
As this is one of the main formulas in information retrieval, it would be
nice if it was implemented correctly.
Any help appreciated...
