lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader
Date Sat, 24 May 2008 15:17:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599606#action_12599606
] 

Yonik Seeley commented on LUCENE-1195:
--------------------------------------

{quote}SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader
only
puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if
e. g.
a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put
into
the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out
its
own cache with such a query.
{quote}

Hmmm, clever, and pretty much free.

It doesn't seem like it would eliminate something like a RangeQuery adding to the cache, but
does reduce the amount of pollution.  Seems like about 1/64th of the terms would be added
to the cache?  (every 128th term and the term following that... due to "numScans > 1" check).

Still, it would take a range query covering 64K terms to completely wipe the cache, and as
long as that range query is slow relative to the term lookups, I suppose it doesn't matter
much if the cache gets wiped anyway.  A single additional hash lookup per term probably shouldn't
slow the execution of something like a range query that much either.



> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being
done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is
called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement
is
> possible here if we avoid the second lookup. An easy way to do this is to add a small
LRU 
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index
of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message