lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Tue, 03 Apr 2007 12:21:32 GMT


Michael McCandless commented on LUCENE-843:

A few notes from these results:

  * A real Lucene app won't see these gains because frequently the
    retrieval of docs from the content source, and the tokenization,
    take substantial amounts of time whereas for this test I've
    intentionally minimized the cost of those steps but they are very
    low for this test because I'm 1) pulling one line at a time from a
    big text file, and 2) using my simplistic SimpleSpaceAnalyzer
    which just breaks tokens at the space character.

  * Best speedup is ~4.3X faster, for tiny docs (~550 bytes) with term
    vectors and stored fields enabled and using autoCommit=false.

  * Least speedup is still ~1.6X faster, for large docs (~55,000
    bytes) with autoCommit=true.

  * The autoCommit=false cases are a little unfair to the new patch
    because with the new patch, you get a single-segment (optimized)
    index in the end, but with existing Lucene trunk, you don't.

  * With term vectors and/or stored fields, autoCommit=false is quite
    a bit faster with the patch, because we never pay the price to
    merge them since they are written once.

  * With term vectors and/or stored fields, the new patch has
    substantially better RAM efficiency.

  * The patch is especially faster and has better RAM efficiency with
    smaller documents.

  * The actual HEAP RAM usage is quite a bit more stable with the
    patch, especially with term vectors & stored fields enabled.  I
    think this is because the patch creates far less garbage for GC to
    periodically reclaim.  I think this also means you could push your
    RAM buffer size even higher to get better performance.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>                 Key: LUCENE-843
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch,
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message