lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Query Filters on term A in query "A AND (B OR C OR D)"
Date Thu, 13 Nov 2003 20:51:31 GMT
Dan Quaroni wrote:
> name:Bob's Discount Furniture AND state:California AND city:San Diego
> Now, that query is going to retrieve EVERY Bob's discount furniture, EVERY
> company in California, and EVERY city in San Diego and then join them.  That
> makes the memory requirements for this query far higher than they really
> need to be.

First, it doesn't have to retrieve "every company in California", only 
an integer (the internal document id) representing that company.  Even 
then, these integers are compressed so that each requires about a byte 
of i/o.  So if you have 10,000 companies, then, yes, 10k bytes must be 
processed, but that's actually not much work, much less than, e.g., 
processing a 10,000 datastructures, each representing a company. 
Lucene's datastructures are highly optimized for this.  That's why 
Lucene is faster than an RDBMS for full text searching.

Second, while these must all be processed, they're not all in memory at 
once.  Posting lists are merged, so memory requirements are proportional 
to the number of terms in the query (one buffer per term).  So the issue 
is not memory, but compute time.

So one could potentially optimize some cases: if there are lots of one 
term but only a few of another then in theory you could just enumerate 
postings of lower frequency term, then fetch the document record for 
these and check it for the higher frequency term.  This is what an RDBMS 
does to optimize queries.  In practice, with Lucene, the difference in 
frequency would have to be *very* large in order for this to be faster.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message