lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader
Date Fri, 23 May 2008 02:19:55 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Busch updated LUCENE-1195:
----------------------------------

    Attachment: lucene-1195.patch

In the previous patch was a silly thread-safety problem that I fixed now. 
Some threads in the TestIndexReaderReopen test occasionally hit 
errors (I fixed the testcase to fail now whenever an error is hit).

I made some other changes to the TermInfosReader. I'm not using
two ThreadLocals anymore for the SegmentTermEnum and Cache,
but added a small inner class called ThreadResources which holds
references to those two objects. I also minimized the amount of
ThreadLocal.get() calls by passing around the enumerator.

Furthermore I got rid of the private scanEnum() method and inlined
it into the get() method to fix the above mentioned thread-safety 
problem. And I also realized that the cache itself does not have to
be thread-safe, because we put it into a ThreadLocal.

I reran the same performance test that I ran for the first patch and
this version seems to be even faster: 107secs vs. 112secs with 
the first patch (~30% improvement compared to trunk, 152secs).

All tests pass, including the improved
TestIndexReaderReopen.testThreadSafety(), which I ran multiple
times.

OK I think this patch is ready now, I'm planning to commit it in a
day or so.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being
done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is
called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement
is
> possible here if we avoid the second lookup. An easy way to do this is to add a small
LRU 
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index
of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message