lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Optimizing SegmentTermEnum (and friends)
Date Tue, 25 Feb 2003 22:13:50 GMT
Dmitry Serebrennikov wrote:
> 1) Since I do not need the intermediate terms, it makes sence to try to 
> have a method that skips to the right term without creating the 
> intermediate Term objects. I have done a version of this yesterday and 
> ended up seeing a factor of 2 performance encrease and a factor of 2 
> garbage reduction. The patch adds the following method to
> final int compareTo(String otherField, char[] otherText, int start, int 
> len)
> And changes to delay creation of Term object until 
> call to term().
> Full diff is attached. Any comments are welcome, especially if I've 
> missed something.

Looks reasonable to me.  Does it still pass all of the unit tests?

> 3) I found a piece of code in that uses a field 
> SegmentTermEnum.prev to try to optimize seeks. It looks like this code 
> was put in after the original SegmentTermEnum was finished. I can't find 
> any record of this change in Jakarta's CVS, so probably it was done 
> prior to moving to Jakarta. Does anyone remember why this is here? Does 
> it actually serve a useful purpose? It seems that the condition this 
> code is testing for would not really occur. Perhaps I'm missing 
> something. Here's the code fragment that uses the .prev field:
>  /** Returns the TermInfo for a Term in the set, or null. */
>  final synchronized TermInfo get(Term term) throws IOException {
>    if (size == 0) return null;
>      // optimize sequential access: first try scanning cached enum w/o 
> seeking
>    if (enum.term() != null              // term is at or past current
>        && ((enum.prev != null && term.compareTo(enum.prev) > 0)
>            || term.compareTo(enum.term()) >= 0)) {
>        int enumOffset = (enum.position/TermInfosWriter.INDEX_INTERVAL)+1;
>        if (indexTerms.length == enumOffset      // but before end of block
>            || term.compareTo(indexTerms[enumOffset]) < 0)
>                return scanEnum(term);              // no need to seek
>    }
>      // random-access: must seek
>    seekEnum(getIndexOffset(term));
>    return scanEnum(term);
>  }

If you put a print statement in this and run the unit tests you'll see 
that this optimization fires a lot.  If, e.g., one expands a wildcarded 
string into a bunch of terms, which are near one another in the enum, 
then subsequently asks for the frequency of each term (to weight it in a 
query), and then, in a third pass, ask for its TermDocs, then each of 
these latter passes benefit from this optimization.  So let's not lose it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message