lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Resolved: (LUCENE-211) [Patch] replace DocumentWriter with InvertedDocument for performance
Date Sun, 30 Dec 2007 18:56:43 GMT


Michael Busch resolved LUCENE-211.

    Resolution: Duplicate
      Assignee:     (was: Lucene Developers)

This is a very similar idea to LUCENE-843, which is already committed.

> [Patch] replace DocumentWriter with InvertedDocument for performance
> --------------------------------------------------------------------
>                 Key: LUCENE-211
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: unspecified
>         Environment: Operating System: All
> Platform: All
>            Reporter: Brian Slesinsky
>            Priority: Minor
>         Attachments: inverted-doc.patch
> I've found a way to improve Lucene's indexing performance by about 45% for my dataset.
> Here's how it works:  currently the indexing process goes like this:
> - use DocumentWriter to create an inverted index and serialize a one-document segment
to a 
> RAMDirectory
> - when enough documents have been read, deserialize the one-document segments in the

> RAMDirectory and merge them, writing the merged segment to disk.
> What I've done instead is create a new class, InvertedDocument, that keeps the inverted
index in a Map, 
> and can also be used directly as input for a merge.  This avoids the serialization/deserialization
> and the RAMDirectory is no longer used when indexing.
> The patch applies to the contents of CVS as of today (April 3).  (It's a big patch and
includes some 
> minor style tweaks that aren't directly related.)
> I did the performance testing using a simple application that creates an index from a
file containing 
> messages extracted from a bulletin board.  It could index about 100 kilobytes/second
with Lucene 1.3, 
> and 145 kilobytes/second with the patch.  This is on an 700Mhz eMac, which is pretty
slow at Java, and 
> the documents being indexed are, on average, less than a screenful.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message