lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
Date Mon, 06 Nov 2006 03:24:19 GMT
    [ ] 
Grant Ingersoll commented on LUCENE-675:

OK, here is a first crack at a standard benchmark contribution based on Andrzej original contribution
and some updates/changes by me.  I wasn't nearly as ambitious  as some of the comments attached
here, but I think most of them are good things to strive for and will greatly benefit Lucene.

I checked in the basic contrib directory structure, plus some library dependencies, as I wasn't
sure how svn diff handles those.  I am posting this in patch format to solicit comments first
instead of just committing and accepting patches.  My thoughts are I'll take a round of comments
and make updates as warranted and then make an initial commit.  

I am particularly interested in the interface/Driver specification and whether people think
this approach is useful or not.  My thoughts behind it were it might be nice to have a standard
way of creating/running benchmarks that could be driven by XML configuration files (some examples
are in the conf directory).  I am not 100% sold on this and am open to compelling arguments
why we should just have each benchmark have it's own main() method.

As for the actual Benchmarker, I have created a "standard" version, which runs off the Reuters
collection that is downloaded automatically by the ANT task.  There are two ANT targets for
the two benchmarks: run-micro-standard and run-standard.  The micro version takes a few minutes
to run on my machine (it indexes 2000 docs), the other one takes a lot longer.

There are several support classes in the stats and util packages.  The stats package supports
building and maintaining information about benchmarks.  The utils package contains one class
for extracting information out of the Reuters documents for indexing.

The ReutersQueries class contains a set of Queries I created by looking at some of the docs
in the collection and are a myriad of term, phrase, span, wildcard and other types of queries.
 They aren't exhaustive by any means.

It should be stressed that these benchmarks are best used in gathering before and after numbers.
 Furthermore, these aren't the be all end all of benchmarking for Lucene.  I hope the interface
nature will encourage others to submit benchmarks for specific areas of Lucene not covered
by this version.

Thanks to all who contributed their code/thoughts.  Patch to follow

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>                 Key: LUCENE-675
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments:, extract_reuters.plx,,
> We need an objective way to measure the performance of Lucene, both indexing and querying,
on a known corpus. This issue is intended to collect comments and patches implementing a suite
of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original
Reuters collection, available from
or I
propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically
retrieve it from known locations, and cache it locally.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message