lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: constant scoring queries
Date Thu, 12 May 2005 15:11:56 GMT
On Tuesday 10 May 2005 22:39, Doug Cutting wrote:
> Background: In, 
> Yonik Seely proposes a ConstantScoreQuery, based on a Filter.  And in 
> I proposed a mechanism to promote the use of Filters.  Through all of 
> this, Paul Elshot has hinted that there might be a better way.

Well, I did not think of combining constant scores with
evaluating clauses one by one, but it makes for a nice fit
in mixing pure boolean as constant scoring and normally
scored queries.

See below for some more comments.
> Here's another proposal, tackling many of the same issues:
> 1. Add two methods to
>    public boolean constantScoring();
>    public void constantScoring(boolean);
>    When constantScoring(), the boost() is the score for matches.

Since many types of queries can become constant scoring, and attribute
like this is better than a separate query class for constant scoring queries.

> 2. Add two methods to
>    public BitSet cachedBitSet(Query) { return null; }
>    public void cacheBitSet(Query, BitSet) {}
>    IndexSearcher overrides these to maintain an LRU cache of bitsets.

A cache of filters might end up having different implementations
of filters to save memory, see also below.
> 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses 
> is not thrown.

TooManyClauses in BooleanQuery works nicely, but
I prefer not to have a direct association between BooleanQuery and 
TooManyClauses. Instead I'd rather have a factory for the term scorers
that use buffer space (TermScorer and SpanTermScorer) and have
this factory throw TooManyClauses. The factory would be passed to
the rewrite() method.
This is the approach taken in the Surround query language, which
has two different factories, but they could be easily combined.
> 4. Modify BooleanScorer to, if constantScoring(),
>    - check Searcher for a cached bitset
>    - failing that, create a bitset
>    - evaluate clauses serially, saving results in bitset

That would only be possible for optional clauses, ie. the ones
passed to a DisjunctionScorer in the trunk. For this case, the
DisjunctionScorer would need to be extended. Btw. this would
be even faster than any available boolean scorer since there is
no need to compare doc id's in any way during scoring.

>    - cache the bitset

Or a compact version of it as in

>    - use the bitset to handle doc(), next() and skipTo();

and implement boolean AND, OR and NOT when combining filters.

> 5. TermQuery and PhraseQuery could be similarly modified, so that, when 
> constant scoring, bitsets are cached for very common terms (e.g., >5% of 
> documents).
> With these changes, WildcardQuery, PrefixQuery, RangeQuery etc., when 
> declared to be constant scoring, will operate much faster and never 
> throw TooManyClauses.  We can add an option (the default?) to 
> QueryParser to make these constant scoring.
> Also, instead of BitSet we could use an interface:
>    public interface DocIdSet {
>      void add(int docId);
>      boolean contains(int docId);
>      int next(int docId);
>    }
> to permit sparse representations.

The compact sparse representation referred to above stores
the differences between doc id's as VInt's, so it can only provide
an iterator over document numbers and not a random access as
in contains(int) above.

It needs only be used when it uses less memory than a bitset, which is the
case when it allows rougly less than 1/8 of the docs of the reader.
> Thoughts?

Yes, thanks for combining the constant score with the one by one approach :)

Paul Elschot.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message