lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Wu <>
Subject Re: Stats in CustomScoreProvider + (in)correctness of LMDirichletSimilarity
Date Fri, 01 May 2015 22:10:14 GMT
Sorry, I was wrong on my solution for #2 -- linking some equations here
<> that
should explain a consistent approach.  Leaving LMDirichletSimilarity as-is
skews the "additive queryNorm" factor.  LMDirichletSimilarity should have
only the following in its .score() function:
    term_score_proposed = Math.log(1 + freq /
        (mu * collectionProbability))
At this point, the score is rank-equivalent with the correct score.
However, to get correct probabilities for other purposes (e.g., weighting
pseudo-relevance query expansion), the final score would need to add in:
    query_score = sum_matched(term_score_proposed) +
where there is a difference between matched terms and all terms.

Any help on how to implement this, especially getting the
collectionProbabilities into CustomScoreProvider, would be appreciated.


On Fri, May 1, 2015 at 11:16 AM, Stephen Wu <> wrote:

> I am having trouble getting collection probabilities for a term to show up
> in a CustomScoreQuery/CustomScoreProvider.  Basically, I am trying to add a
> per-document weight that amounts to the sum (for each term in the query) of
> Math.log(collectionProbability).  Can anyone help with this?
> Or feel free to suggest a better way to do this.  Here's a description...
> -----
> LMDirichletSimilarity is not consistent with the original equations, as
> many have noted.  Here's how it's different under two
> 1. *Swap in LMDirichletSimilarity* in place of some other similarity, but
> modify the scoring function.  Ignoring the boost, it is currently
> implemented as:
>     term_score_current = Math.log(1 + freq /
>         (mu * collectionProbability)) +
>         Math.log(mu / (docLen + mu))
> If you do this, there are two problems.  The first problem is that the
> score is off by a factor of Math.log(collectionProbability).  Do the math
> <>: if you add
> that in, you will get something equal to form of the original formulation
> (e.g., in Zhai and Lafferty 2001).  For reference, that looks like:
>     term_score_official = Math.log( (freq+mu*collectionProbability) /
> (docLen+mu) )
> If you add that factor, though, the second problem arises.  That
> Math.log(collectionProbability) factor does not get added for terms that
> don't MATCH with a document because .score() doesn't get called if there's
> no MATCH.  This is basically the problem that Ronan Cummins wrote about a
> few weeks ago.
> 2. *Leave LMDirichletSimilarity as it is* but *add a factor* to every
> final score that is returned*.*  (Note: you'd also need to remove the
> non-negative score restriction in LMDirichletSimilarity.)  This would be
> the sum of the log collection probabilities for each term:
>     query_score = sum(term_score_current) +
> sum(Math.log(collectionProbability))
> As some have mentioned, this is basically an additive version of a
> queryNorm.  It seems like the right way to do this is to wrap each query in
> a modified CustomScoreQuery accessing a CustomScoreProvider, which would
> then add that "constant" factor across all documents.  However, this
> "constant" factor needs to be computed from statistics; how can this be
> done?  Those statistics are available in LMDirichletSimilarity, but it is
> less clear how to find those statistics directly from a Query object.
> stephen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message