lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <>
Subject Re: Reducing number of poor results from large BooleanQueries
Date Sat, 10 Sep 2005 00:45:12 GMT
Isn't the trouble with introducing a scoring threshold based on raw 
scores that the Similarity scoring mechanism is considering each 
document in isolation? At this stage we don't know if the query is 
generally a good one or not (ie spelt correctly, and not a Googlewhack 
combination of rarely colocated terms).  For example, we dont know if, 
in general, the coord factor was very poor for all docs and so our score 
threshold used by each doc should be relaxed as a consequence.

A simple solution may be to delay thresholding until all results are in 
and to consider the top result as the "best you can get" for the given 
query ie "100%" and setting the threshold for accepting other results at 
something like 70% of the top score.

This too has its faults: I've found it useful to consider examples of 
different queries and the distribution of their (normalized) scores.

* GoogleWhack query (rare or misspelt terms -  hi idf, low coord- only 
one result with ALL terms)
[octupus jacuzzi tango]
1, 0.30, 0.30, 0.25, 0.25

* Very rare query (rare or misspelt terms -  hi idf, very low coord- NO 
result with ALL terms)
[octupus jacuzzi unicycle]
1, 0.90, 0.88, 0.88, 0.88

* Good query  (some rarer terms maybe some common - but several docs 
contain > 1 of the rarer terms)
[installing a jacuzzi in the home]
1, 90, 80, 78, 30, 20

* Too-common query (many common terms - results have hi coord but low idf):
[home page of the web site]
1, 0.99, 0.99, 0,98, 0.93

Looking at these normalized scores I suspect this "70% of top" rule 
doesn't work well in all cases. Maybe a better solution lies in mixing 
the "% of top" rule with the raw-scores thresholds somehow.

How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message