lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: need some advice/help with negative query.
Date Fri, 06 Jan 2006 21:56:56 GMT
On Friday 06 January 2006 18:04, Beady Geraghty wrote:
> I would like to do queries that are negative. I mean a query with
> only negative terms and phrases.  For example, retrieve all
> documents that do not contain the term "apple".
> For now, I have a limited set of documents (say, 10000) to index.
> I can create a bitset that represents the search result of hits on "apple".
> Then I complement (XOR) the result.
> Each bit corresponds to a document ID.
> My question is :
> Inside Lucene, are the hits represented in some form of a bitset.
> Can I get at it directly.   I saw the BitSet class.  (I now use
> Java's Bitset class).
> Assuming that hits are internally represented as bitset, for a
> small number of documets, the bitset won't be very big,
> and if there are plenty of hits and many many more documents,
> is the bitset still  kept entirely
> in memory as well ?

A Hits is implemented by caching some of the highest scoring
documents, when more documents are needed the search is
repeated to collect more documents.

The problem with negative queries is that the scores of the results
do not vary, so it is not useful to keep only the highest scoring docs.
This also means that all results will have to be processed further
in some other way.
The easiest way to do that is to use the MatchAllDocsQuery
as indicated earlier, and then use the low level search API
with your own HitCollector.
You can then use any data structure in your HitCollector.
A simple and fast collect() implementation just counts the results, and
that can already be quite informative. Setting up a BitSet 
for the matching document numbers is also possible.
It's best to avoid accessing the index via the IndexReader inside
the collect() implementation of the HitCollector.

Paul Elschot

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message