lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject scorer optimizations
Date Wed, 21 Apr 2004 17:18:18 GMT wrote:
>   Bug in PhraseScorer and ConjunctionScorer
>   skipTo implementation fixed.


Thanks for finding fixing these.  This reminds me of a couple of 
scorer-related optimizations that we should keep in mind:

1. As you've noticed, the logic of ConjunctionScorer and PhraseScorer 
(and SpansNear, for that matter) are very similar, except 
ConjunctionScorer uses java's collection classes, while PhraseScorer 
uses Lucene's PriorityQueue and a home-grown linked list.  Because of 
this, PhraseScorer allocates no new objects while scoring, while 
ConjuctionScorer must allocate a number of objects each time skipTo() is 
called.  ConjunctionScorer is still usually much faster than 
BooleanScorer, since it uses SkipTo, but it could be much faster yet. 
So someday we should probably re-write ConjunctionScorer to use a 
PriorityQueue, so that it allocates fewer objects while scoring.

2. BooleanScorer does not return hits in strict document-id order 
(arguably a bug).  Because of this, ConjunctionScorer cannot be used 
when any of its sub-scorers use BooleanScorer, i.e., with sub-queries 
that contain optional or prohibited clauses.  This is unfortunate, as 
there are probably lots of cases where a conjunction of disjunctions 
could be accellerated by skipTo() that we are not accellerating.  One 
way to fix this is to re-write BooleanScorer so that it always returns 
hits in order, using an algorithm like that in SpanOrQuery.getSpans(). 
However, that would make large disjunctions somewhat slower, by a factor 
of log(n), where n is the number of clauses in the disjunction.  I don't 
know whether this would really be significant.  Another approach might 
be to write a version of BooleanScorer that returns hits in order, and 
only to use it when it is embedded in a conjunction.  This is trickier 
to implement, and it would be best if BooleanScorer always returned hits 
in the correct order, so I lean slightly towards the first approach.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message