lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <>
Subject [jira] Commented: (LUCENE-1187) Things to be done now that Filter is independent from BitSet
Date Tue, 13 May 2008 22:17:55 GMT


Mark Harwood commented on LUCENE-1187:

Good work.
Just tried the patch and ran some pre and post-patch benchmarks.

I wanted to measure the overhead of :
   the new OpenBitSetDISI.inPlaceOr(DocIdSetIterator) 
  the previous scheme of BitSet.or(BitSet).

My test was on the biggest index I have here which was 3 million Wikipedia docs. I had 2 cached
TermFilters on very popular terms (500k docs in each) and was measuring the cost of combining
these as 2 "shoulds" in a BooleanFilter.
The expectation was the new scheme would add some overhead in extra method calls.

The average cost of iterating across BooleanFilter.getDocIdSet() was:

old BitSet scheme: 78 milliseconds
new DISI scheme: 156 milliseconds.

To address this I tried adding this optimisation into BooleanFilter...

               DocIdSet dis = ((Filter)shouldFilters.get(i)).getDocIdSet(reader);
        	if(dis instanceof OpenBitSet)
        		res.or((OpenBitSet) dis); // go-faster method
        		res.inPlaceOr(getDISI(shouldFilters, i, reader)); //your patch code
Before I could benchmark this I had to amend TermsFilter to use OpenBitSet rather than plain
old BitSet 

avg speed of your patch with OpenBitSet-enabled TermFilter :   100 milliseconds
avg speed of your patch with OpenBitSet-enabled TermFilter and above optimisation :   70 milliseconds

I'll try and post a proper patch when I get more time to look at this...


> Things to be done now that Filter is independent from BitSet
> ------------------------------------------------------------
>                 Key: LUCENE-1187
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*, Search
>            Reporter: Paul Elschot
>            Assignee: Michael Busch
>            Priority: Minor
>         Attachments: BooleanFilter20080325.patch, ChainedFilterAndCachingFilterTest.patch,
Contrib20080325.patch, Contrib20080326.patch, Contrib20080427.patch, javadocsZero2Match.patch,
> (Aside: where is the documentation on how to mark up text in jira comments?)
> The following things are left over after LUCENE-584 :
> For Lucene 3.0  Filter.bits() will have to be removed.
> There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the boolean
behaviour of a Filter.
> I have not looked into Filter caching yet, but I suppose there will be some room for
improvement there.
> Iirc the current core has moved to use OpenBitSetFilter and that is probably what is
being cached.
> In some cases it might be better to cache a SortedVIntList instead.
> Boolean logic on DocIdSetIterator is already available for Scorers (that inherit from
DocIdSetIterator) in the search package. This is currently implemented by ConjunctionScorer,
> ReqOptSumScorer and ReqExclScorer.
> Boolean logic on BitSets is available in contrib/misc and contrib/queries
> DisjunctionSumScorer calls score() on its subscorers before the score value actually
> This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as a superclass
of DisjunctionSumScorer.
> To fully implement non scoring queries a TermDocIdSetIterator will be needed, perhaps
as a superclass of TermScorer.
> The javadocs in using matching vs non-zero score:
> I'll investigate this soon, and provide a patch when necessary.
> An early version of the patches of LUCENE-584 contained a class Matcher,
> that differs from the current DocIdSet in that Matcher has an explain() method.
> It remains to be seen whether such a Matcher could be useful between
> DocIdSet and Scorer.
> The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
> This was also discussed at another issue recently, so perhaps it is wortwhile to open
a separate issue for this.
> Skipping on a SortedVIntList is done using linear search, this could be improved by adding
multilevel skiplist info much like in the Lucene index for documents containing a term.
> One comment by me of 3 Dec 2008:
> A few complete (test) classes are deprecated, it might be good to add the target release
for removal there.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message