lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject Re: lucene parser, negative OR operands
Date Tue, 17 May 2011 23:22:36 GMT
On 5/17/2011 7:07 PM, Markus Jelsma wrote:
>
> This is propably due to the contents (HTML bodies) of the documents i've
> queried. It's not so strange for this type of document to return less
> documents when two negated operands are specified. In my case (i tested it) a
> conjunction returned the same documents as a disjunction did.
>
> Again, i haven't done extensive testing on this subject.

I think we're disagreeing on what the proper behavior is. Because my 
understanding of the proper behavior under boolean logic, it doesn't 
matter the contents of your documents, it is logically impossible. 
Perhaps I am wrong to expect that lucene's pseudo-boolean operators will 
behave like actual boolean logic?

under boolean logic -- assuming "-one" means the same thing as "NOT one" 
-- both mean "all documents that do not have 'one'", right? :

-one OR -two === (NOT one) OR (NOT two) ===  NOT (  one AND two )

And it is logically impossible for that query to return FEWER results 
than "-one" alone does, or than "-two" alone does, in ANY corpus.  It 
can return the same #, or it can return more.  You can never get fewer 
documents by adding an "OR" union on, right? That's a set union, union 
of set A with some other (possibly empty) set B can never have fewer 
members than set A alone!

In fact, playing around more and comparing hit counts, it looks like 
Solr 1.4.1 lucene query parser treats:

"-one OR -two"
the same as
NOT (one OR two)

Which is not/should not be the same query at all.

The first is "all documents that don't have 'one' COMBINED WITH all 
documents that don't have 'two'".  The second is "all documents that 
have NEITHER 'one' NOR 'two'".   Those are two different things, or 
ought to be.

Or am I wrong to think that? That is certainly the way boolean algebra 
works; if "-one" is a boolean negation the same as "NOT one". Then "-one 
OR -two" definitely ought _not_ to be the same query as "NOT (one OR 
two)".   But maybe I should not be expecting predictable boolean algebra 
here? But if that's the case then I'm not sure what behavior I should be 
expecting, what the expected predictable behavior of these operators is!

If we want to make things even more confusing, I can supply some other 
patterns involving an explicit "NOT" that also don't work how I expect 
or according to any predictable way I can figure out.

Mime
View raw message