lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject Re: Anyway to not bother scoring less good matches ?
Date Thu, 05 May 2011 10:31:40 GMT
On 05/05/2011 11:13, Ahmet Arslan wrote:
>> Yes correct, but I have looked and the list of
>> optimizations before. What was clear from profiling was that
>> it wasnt the searching part that was slow (a query run on
>> the same index with only a few matching docs ran super fast)
>> the slowness only occurs when there are loads of matching
>> docs, and spends most of its time in scorer that is why I
>> was trying to remove the poor matches.
> Okey all clear. Can you give us some example query strings where there are loads of matching?
> Do you use stop word filter? Could it be case described as
> "As you approach the upper limits of a single machine,
> extremely frequent terms (called stop words) can become very
> expensive in the wrong query. If part of a top level BooleanQuery, a
> SHOULD clause that appears in every document will cause a match and
> score for every document in your index."
We used to use the default stop word list but have no stop words because 
Lucene is used to match very short fields  relating to Musicdata such as 
artist or album name, therefore the default stop words really need to be 
included to get good matches, for example how would you match the artist 
'The The' otherwise, so use of a stop word word list is not an option.

If people construct good queries thee is no problem, but the trouble is 
that many users just OR everything they are looking for because they 
don't want a good match rejected because just one term fails, but the 
problem is there are a number of very popular terms, for example the 
following query:

tnum:(6) qdur:(189) artist:(tama) track:(ibata) tracks:(10) release:(the 
global rhythm september 2002)

will match any song that is on an album with 10 tracks, any song which 
is trackno 6 on an album, and any release containg the word 'the' , when 
really what they are looking for is the song 'ibata' by artist 'tama',

This matches over a million documents (songs) , but doesn't match any 
well, because the song 'ibata' by 'tama' isnt actually in the index !

So I dont think the query is very good but I cannot force users to 
submit better queries, but I want to protect the server by reducing the 
time these kind of query take (upto 1 second as opposed to the more 
usual 100 milliseconds) and I hope that forcing x number of terms to 
match would do that.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message