lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beady Geraghty <beadygerag...@gmail.com>
Subject Re: need some advice/help with negative query.
Date Fri, 06 Jan 2006 22:22:07 GMT
Thanks.

On 1/6/06, Paul Elschot <paul.elschot@xs4all.nl> wrote:
>
> On Friday 06 January 2006 18:04, Beady Geraghty wrote:
> > I would like to do queries that are negative. I mean a query with
> > only negative terms and phrases.  For example, retrieve all
> > documents that do not contain the term "apple".
> >
> > For now, I have a limited set of documents (say, 10000) to index.
> > I can create a bitset that represents the search result of hits on
> "apple".
> > Then I complement (XOR) the result.
> > Each bit corresponds to a document ID.
> > My question is :
> > Inside Lucene, are the hits represented in some form of a bitset.
> > Can I get at it directly.   I saw the BitSet class.  (I now use
> > Java's Bitset class).
> > Assuming that hits are internally represented as bitset, for a
> > small number of documets, the bitset won't be very big,
> > and if there are plenty of hits and many many more documents,
> > is the bitset still  kept entirely
> > in memory as well ?
>
> A Hits is implemented by caching some of the highest scoring
> documents, when more documents are needed the search is
> repeated to collect more documents.
>
> The problem with negative queries is that the scores of the results
> do not vary, so it is not useful to keep only the highest scoring docs.
> This also means that all results will have to be processed further
> in some other way.
> The easiest way to do that is to use the MatchAllDocsQuery
> as indicated earlier, and then use the low level search API
> with your own HitCollector.
> You can then use any data structure in your HitCollector.
> A simple and fast collect() implementation just counts the results, and
> that can already be quite informative. Setting up a BitSet
> for the matching document numbers is also possible.
> It's best to avoid accessing the index via the IndexReader inside
> the collect() implementation of the HitCollector.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message