lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader
Date Wed, 21 May 2008 08:37:55 GMT


Michael Busch updated LUCENE-1195:

    Attachment: lucene-1195.patch

Changes in the patch:
- the used cache is thread-safe now
- added a ThreadLocal to TermInfosReader, so that each thread has its own cache of size 1024
- SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader
  puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus,
if e. g.
  a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be
put into
  the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out
  own cache with such a query. 
- added a new package org/apache/lucene/util/cache that has a SimpleMapCache (taken from LUCENE-831)
  and the SimpleLRUCache that was part of the previous patch here. I decided to put the caches
  a separate package, because we can reuse them for different things like LUCENE-831 or e.
g. after
  deprecating Hits as LRU cache for recently loaded stored documents.
I reran the same performance experiments and it turns out that the speedup is still the same
the overhead of the ThreadLocal is in the noise. So I think this should be a good approach

I also ran similar performance tests on a bigger index with about 4.3 million documents. The

speedup with 50k AND queries was, as expected, not as significant anymore. However, the speedup
was still about 7%. I haven't run the OR queries on the bigger index yet, but most likely
speedup will not be very significant anymore.

All unit tests pass.

> Performance improvement for TermInfosReader
> -------------------------------------------
>                 Key: LUCENE-1195
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>         Attachments: lucene-1195.patch, lucene-1195.patch
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement
> possible here if we avoid the second lookup. An easy way to do this is to add a small
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message