lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: relevance ranking and scoring
Date Tue, 23 Jan 2007 21:45:58 GMT
On 1/23/07, Andrew Nagy <andrew.nagy@villanova.edu> wrote:
> I have 2 questions about the SOLR relevancy system.

As far as scoring, it's pretty much stock lucene with some other stuff
added on (like function query).
http://lucene.apache.org/java/docs/scoring.html

> 1. Why is it when I search for an exact phrase of a title of a record I
> have it generally does not come up as the 1st record in the results?
>
> ex: title:(gone with the wind), the record comes up 3rd.  A record with
> the term "wind" as the first word in the title comes up 1st.
> ex: title:"gone with the wind", the record comes up 1st.

Well, you could do an exact or sloppy phrase match
title:"gone with the wind"
But I get your point... if you want to also match records with just "wind".

> Is this because the word "wind" is the only noun?

Yes, this probably came about because of lucene's length normalization
in the default similarity.  It's 1/sqrt(num_terms_in_field)

So a document with a title of "wind" has a "norm" of 1.0, while a
document with 4 terms has a "norm" of .7
Still, it seems like the coord factor (number of terms matching)
should have been more than enough to overcome the length
normalization.  What were the exact titles?  I assume you were not
using any type if index-time boosting?

Things you can try:
- post the debugging output (including score explain) for the query
- try disabling length normalization for the title field, then remove
the entire index and re-idnex.
- try the dismax handler, which can generate sloppy phrase queries to
boost results containing all terms.
- try a different similarity implementation
(org.apache.lucene.misc.SweetSpotSimilarity from lucene)


> 2. The "score" that is associated with each value is quite odd, what
> does it represent.  I generally get results with the top record being
> somewhere around 3.0 or 2.0 and most records are below 1.

Scores aren't too comparable across different queries... the scores
are only meant to rank documents with respect to a single query.

-Yonik

Mime
View raw message