lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Fluit <>
Subject determination of matching hits
Date Mon, 20 Dec 2004 15:32:05 GMT
Hello all,

I have a question regarding the determination of the set of matching 
documents, in particular (I guess) related to the NOT operator.

In my case I have a document containing the terms A and B. When I query 
for either A or for B, I get this document back, just as expected. Now 
when I query for A -B, I once again get this document back. In other 
words: this document matches both B and a query containing the clause 
-B, which theoretically should never happen.

I've seen this happen with various keywords, sometimes with multiple 
"conflicting" documents. In each case, the B query returned the document 
with a very low relevance (e.g. 0.007...).

Based on these low relevancies and a quick peek in the Lucene code, I 
strongly suspect that this is caused by rounding errors, as it seems to 
me that floating point numbers are used to both express the membership 
of a set as well as its score. Can somebody confirm this?

And if this is the case, is there a workaround to eliminate or at least 
significantly suppress this problem? A colleague mentioned boosting 
every term in a query, would this solve anything?

For most search engine-like applications, which order documents on 
relevance, I think this problem is not a real issue since such 
conflicting documents appear at the end of the result list and are not 
likely to be seen by the user. However, in our case we have an 
application which displays overlaps of entire result sets and these 
documents show up very prominently (I can show screenshots if desired). 
We have already been asked by customers to explain these results :)

FYI, in case it may be relevant: I'm still using Lucene 1.4.2. Every 
document has the same set of five fields. The above queries are parsed 
by MultiFieldQueryParser, using all five fields. I haven't touched the 
default operator, but the queries A AND -B and A AND NOT B give the same 
conflicting overlap in the result set.

Thanks in advance,

Christiaan Fluit

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message