lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Updated: (LUCENE-1301) Refactor DocumentsWriter
Date Tue, 10 Jun 2008 10:18:45 GMT


Michael McCandless updated LUCENE-1301:

    Attachment: LUCENE-1301.patch

> Refactor DocumentsWriter
> ------------------------
>                 Key: LUCENE-1301
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>         Attachments: LUCENE-1301.patch
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc.  This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing.  EG DocConsumer
> consumes the whole document.  DocFieldConsumer consumes separate
> fields, one at a time.  InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer.  TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing.  Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
>   * NormsWriter holds norms in memory and then flushes them to _X.nrm.
>   * FreqProxTermsWriter holds postings data in memory and then flushes
>     to _X.frq/prx.
>   * StoredFieldsWriter flushes immediately to _X.fdx/fdt
>   * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk.  Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
>   * Improved concurrency with mixed large/small docs: previously the
>     thread state would be tied up when docs finished indexing
>     out-of-order.  Now, it's not: instead I use a separate class to
>     hold any pending state to flush to the doc stores, and immediately
>     free up the thread state to index other docs.
>   * Buffered norms in memory now remain sparse, until flushed to the
>     _X.nrm file.  Previously we would "fill holes" in norms in memory,
>     as we go, which could easily use way too much memory.  Really this
>     isn't a solution to the problem of sparse norms (LUCENE-830); it
>     just delays that issue from causing memory blowup during indexing;
>     memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change.  I'll profile & iterate to minimize this, but I think we can
> accept some loss.  I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message