lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <>
Subject [jira] Updated: (LUCENE-790) contrib/benchmark - few improvements and a bug fix
Date Thu, 01 Feb 2007 08:05:10 GMT


Doron Cohen updated LUCENE-790:

    Attachment: TrecDocMaker.patch

Attached TrecDocMaker.patch also contains the changes in current patch in 788 - because both
patches modify ReutersDocMaker - so it is sufficient to apply this patch only. I will add
a comment on that in 788. Once this is committed, will mark 788 as duplicate of this. 

Some TODO items are in byTask/'s javadocs - comments are welcome. 

> contrib/benchmark - few improvements and a bug fix
> --------------------------------------------------
>                 Key: LUCENE-790
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>    Affects Versions: 2.1
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>            Priority: Minor
>             Fix For: 2.1
>         Attachments: TrecDocMaker.patch
> Benchmark byTask was slightly improved:
> 1. fixed a bug in the "child-should-not-report" mechanism. If a task sequence contained
only simple tasks it worked as expected (i.e. child tasks did not report times/memory) but
if a child was a task sequence, then its children would report - they should not - this was
fixed, so this property is now "penetrating/inherited" all the way down.
> 2. doc size control now possible also for the Reuters doc maker. (allowing to index N
docs of size C characters each.)
> 3. TrecDocMaker was added - it reads as input the .gz files used in Trec - e.g. .gov
data - this can be handy to benchmark Lucene on these large collections.  Similar to the Reuters
collection, the doc-maker scans the input directory for all the files and extracts documents
from the files.  Here there are multiple documents in each input file. Unlike the Reuters
collection, we cannot provide a 'loader' for these collections - they are available from
- for research purposes.
> 4. a new BasicDocMaker abstract class handles most of doc-maker tasks, including creating
docs with specific size, so adding new doc-makers for other data is now much simpler.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message