lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject Re: Reducing number of poor results from large BooleanQueries
Date Fri, 09 Sep 2005 09:27:54 GMT
Hi Chris,
Here is an approach which works based on the quantity
of matching terms in an adapted BooleanQuery:

Paul makes an interesting obversation at the end which
shows how this functionality can be added to the
existing BooleanQuery without too much effort. I'd
personally like to see this added to BooleanQuery. As
an example application, I currently use this
functionality in my custom
CoordConstrainedBooleanQuery to prevent "More Like
This" queries returning long lists of dissimilar
documents by insisting on 30% of generated query terms

This approach of course is based purely on the
quantity of matching terms, not the quality-based
measures in your example. As you suggest, quality is a
combination of user-derived measures (boosts) and
data-derived measures (tf,idf, docBoost). It sounds
like a more informed  approach in principle but I'm
not currently sure how it would be implemented
efficiently in practice. Here's one possible approach
I can think of:
I have previously optimized large BooleanQueries
generated by nGrams before now by taking only the top
idf-ranked terms - purely to reduce query times. A
similar approach could be used to automatically
rewrite a BooleanQuery consisting of entirely optional
terms into the equivalent of:
+( my high idf terms) (low idf terms)
Basically this produces a query that MUST match the
decent terms and scores extra points for the "optional
extras". Query term boosts could be factored into the
decision for selecting the "Must have" terms and "nice
to haves".
This would help maintain a minimum level of relevance
when relevance isn't the primary sort field.


Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message