lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: per-fieldtype similarity not working
Date Fri, 08 Jun 2012 13:05:00 GMT
Excellent!
Thanks

 
 
-----Original message-----
> From:Robert Muir <rcmuir@gmail.com>
> Sent: Fri 08-Jun-2012 13:06
> To: Markus Jelsma <markus.jelsma@openindex.io>
> Cc: solr-user@lucene.apache.org
> Subject: Re: per-fieldtype similarity not working
> 
> On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
> <markus.jelsma@openindex.io> wrote:
> > Thanks Robert,
> >
> > The difference in scores is clear now so it shouldn't matter as queryNorm doesn't
affect ranking but coord does. Can you explain why coord is left out now and why it is considered
to skew results and why queryNorm skews results? And which specific new ranking algorithms
they confuse, BM25F?
> 
> I think its easiest to compare the two TF normalization functions,
> DefaultSimilarity really needs something like this because its
> function (sqrt) grows very fast for a single term.
> On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
> rather quickly for a single term, so when multiple terms are being
> scored, huge numbers of occurrences of a single term won't dominate
> the overall score.
> 
> You can see this visually here (give it a second to load, and imagine
> documentLength = averageDocumentLength and k=1.2):
> http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100
> 
> >
> > Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity
this might raise some further confusion down the line.
> 
> Thats ok: I'd rather the very expert case (Per-Field scoring) be
> trickier than have a trap for people that try to use any algorithm
> other than TFIDFSimilarity
> 
> -- 
> lucidimagination.com
> 

Mime
View raw message