lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Turnbull <dturnb...@opensourceconnections.com>
Subject Re: Queries for many terms
Date Mon, 02 Nov 2015 17:50:25 GMT
How precise do you need to be?

I wonder if you could efficiently approximate "number of matches" by
getting the document frequency of each term. I realize this is an
approximation, but the highest document frequency would be your floor.

Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
lowest.

OK the following algorithm could refine your floor
- count = t1.docfreq
- Then issue a query for NOT t1, this eliminates many candidate documents
to improve performance
- Build a bloom filter or other set-membership data structure for t2...tn
https://en.wikipedia.org/wiki/Bloom_filter
- In a PostFilter(?) Lucene Collector(?) scan each collected/returned
document and do a set membership test against the bloom filter. If member,
then increment your count.

It's O(numDocs that don't match t1)

This is me just thinking outloud, but maybe it'll trigger thoughts in
others...
-Doug


On Mon, Nov 2, 2015 at 12:14 PM, Upayavira <uv@odoko.co.uk> wrote:

> I have a scenario where I want to search for documents that contain many
> terms (maybe 100s or 1000s), and then know the number of terms that
> matched. I'm happy to implement this as a query object/parser.
>
> I understand that Lucene isn't well suited to this scenario. Any
> suggestions as to how to make this more efficient? Does the TermsQuery
> work differently from the BooleanQuery regarding large numbers of terms?
>
> Upayavira
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message