lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <>
Subject Re: Performance of never optimizing
Date Mon, 03 Nov 2008 05:58:33 GMT
Hi, Justus,

I had met with very similar problems as JIRA has, which has high 
modification and on a large data volume. It's a pretty common use case 
for Lucene.

The way I dealt with high rate of modification is to create a secondary 
in-memory index. And only persist documents older than a period of time.
So searching will need to combine results from two indexes. It's a bit 
complicated when creating the index, but it's worth well to save the 
extra IO-heavy merging and to improve response time, especially the 
ability to search right away with just added documents.

BTW: JIRA is great!

Chris Lu
Instant Scalable Full-Text Search On Any Database/Application
Lucene Database Search in 3 minutes:
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro

Justus Pendleton wrote:
> Howdy,
> I have a couple of questions regarding some Lucene benchmarking and 
> what the results mean[3]. (Skip to the numbered list at the end if you 
> don't want to read the lengthy exegesis :)
> I'm a developer for JIRA[1]. We are currently trying to get a better 
> understanding of Lucene, and our use of it, to cope with the needs of 
> our larger customers. These "large" indexes are only a couple hundred 
> thousand documents but our problem is compounded by the fact that they 
> have a relatively high rate of modification (=delete+insert of new 
> document) and our users expect these modification to show up in query 
> results pretty much instantly.
> Our current default behaviour is a merge factor of 4. We perform an 
> optimization on the index every 4000 additions. We also perform an 
> optimize at midnight. Our fundamental problem is that these 
> optimizations are locking the index for unacceptably long periods of 
> time, something that we want to resolve for our next major release, 
> hopefully without undermining search performance too badly.
> In the Lucene javadoc there is a comment, and a link to a mailing list 
> discussion[2], that suggests applications such as JIRA should never 
> perform optimize but should instead set their merge factor very low.
> In an attempt to understand the impact of a) lowering the merge factor 
> from 4 to 2 and b) never, ever optimizing on an index (over the course 
> of years and millions of additions/updates) I wanted to try to 
> benchmark Lucene.
> I used the contrib/benchmark framework and wrote a small algorithm 
> that adds documents to an index (using the Reuters doc generator), 
> does a search, does an optimize, then does another search. All the 
> pretty pictures can be seen at:
> I have several questions, hopefully they aren't overwhelming in their 
> quantity :-/
> 1. Why does the merge factor of 4 appear to be faster than the merge 
> factor of 2?
> 2. Why does non-optimized searching appear to be faster than optimized 
> searching once the index hits ~500,000 documents?
> 3. There appears to be a fairly sizable performance drop across the 
> board around 450,000 documents. Why is that?
> 4. Searching performance appears to decrease towards a fairly 
> pessimistic 20 searches per second (for a relatively simple search). 
> Is this really what we should expect long-term from Lucene?
> 5. Does my benchmark even make sense? I am far from an expert on 
> benchmarking so it is possible I'm not measuring what I think I am 
> measuring.
> Thanks in advance for any insight you can provide. This is an area 
> that we very much want to understand better as Lucene is a key part of 
> JIRA's success,
> Cheers,
> Justus
> JIRA Developer
> [1]:
> [2]:
> [3]:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message