On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
<markus.jelsma@openindex.io> wrote:
> Thanks Robert,
>
> The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect
ranking but coord does. Can you explain why coord is left out now and why it is considered
to skew results and why queryNorm skews results? And which specific new ranking algorithms
they confuse, BM25F?
I think its easiest to compare the two TF normalization functions,
DefaultSimilarity really needs something like this because its
function (sqrt) grows very fast for a single term.
On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
rather quickly for a single term, so when multiple terms are being
scored, huge numbers of occurrences of a single term won't dominate
the overall score.
You can see this visually here (give it a second to load, and imagine
documentLength = averageDocumentLength and k=1.2):
http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100
>
> Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity
this might raise some further confusion down the line.
Thats ok: I'd rather the very expert case (PerField scoring) be
trickier than have a trap for people that try to use any algorithm
other than TFIDFSimilarity

lucidimagination.com
