Sorry, I was wrong on my solution for #2  linking some equations here
<http://mathb.in/34502?key=b2b24cfc50ee4983d8a2a0da09bab2686e8f2592> that
should explain a consistent approach. Leaving LMDirichletSimilarity asis
skews the "additive queryNorm" factor. LMDirichletSimilarity should have
only the following in its .score() function:
term_score_proposed = Math.log(1 + freq /
(mu * collectionProbability))
At this point, the score is rankequivalent with the correct score.
However, to get correct probabilities for other purposes (e.g., weighting
pseudorelevance query expansion), the final score would need to add in:
query_score = sum_matched(term_score_proposed) +
sum_all(Math.log(mu*collectionProbability/(docLen+mu)))
where there is a difference between matched terms and all terms.
Any help on how to implement this, especially getting the
collectionProbabilities into CustomScoreProvider, would be appreciated.
stephen
On Fri, May 1, 2015 at 11:16 AM, Stephen Wu <stephen@trapit.com> wrote:
> I am having trouble getting collection probabilities for a term to show up
> in a CustomScoreQuery/CustomScoreProvider. Basically, I am trying to add a
> perdocument weight that amounts to the sum (for each term in the query) of
> Math.log(collectionProbability). Can anyone help with this?
>
> Or feel free to suggest a better way to do this. Here's a description...
>
> 
> LMDirichletSimilarity is not consistent with the original equations, as
> many have noted. Here's how it's different under two
>
> 1. *Swap in LMDirichletSimilarity* in place of some other similarity, but
> modify the scoring function. Ignoring the boost, it is currently
> implemented as:
> term_score_current = Math.log(1 + freq /
> (mu * collectionProbability)) +
> Math.log(mu / (docLen + mu))
>
> If you do this, there are two problems. The first problem is that the
> score is off by a factor of Math.log(collectionProbability). Do the math
> <http://en.wikipedia.org/wiki/List_of_logarithmic_identities>: if you add
> that in, you will get something equal to form of the original formulation
> (e.g., in Zhai and Lafferty 2001). For reference, that looks like:
> term_score_official = Math.log( (freq+mu*collectionProbability) /
> (docLen+mu) )
>
> If you add that factor, though, the second problem arises. That
> Math.log(collectionProbability) factor does not get added for terms that
> don't MATCH with a document because .score() doesn't get called if there's
> no MATCH. This is basically the problem that Ronan Cummins wrote about a
> few weeks ago.
>
> 2. *Leave LMDirichletSimilarity as it is* but *add a factor* to every
> final score that is returned*.* (Note: you'd also need to remove the
> nonnegative score restriction in LMDirichletSimilarity.) This would be
> the sum of the log collection probabilities for each term:
> query_score = sum(term_score_current) +
> sum(Math.log(collectionProbability))
>
> As some have mentioned, this is basically an additive version of a
> queryNorm. It seems like the right way to do this is to wrap each query in
> a modified CustomScoreQuery accessing a CustomScoreProvider, which would
> then add that "constant" factor across all documents. However, this
> "constant" factor needs to be computed from statistics; how can this be
> done? Those statistics are available in LMDirichletSimilarity, but it is
> less clear how to find those statistics directly from a Query object.
>
> stephen
>
>
>
>
>
