lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Tue, 03 Apr 2007 16:51:22 GMT

"Ning Li" <> wrote:
> On 4/3/07, Michael McCandless (JIRA) <> wrote:
> >  * With term vectors and/or stored fields, the new patch has
> >    substantially better RAM efficiency.
> Impressive numbers! The new patch improves RAM efficiency quite a bit
> even with no term vectors nor stored fields, because of the periodic
> in-RAM merges of posting lists & term infos etc. The frequency of the
> in-RAM merges is controlled by flushedMergeFactor, which measures in
> doc count, right? How sensitive is performance to the value of
> flushedMergeFactor?

Right, the in-RAM merges seem to help *alot* because you get great
compression of the terms dictionary, and also some compression of the
freq postings since the docIDs are delta encoded.  Also, you waste
less end buffer space (buffers are fixed sizes) when you merge together
into a large segment.

The in-RAM merges are triggered by number of bytes used vs RAM buffer
size.  Each doc is indexed to its own RAM segment, then once these
level 0 segments use > 1/Nth of the RAM buffer size, I merge into
level 1.  Then once level 1 segments are using > 1/Mth of the RAM
buffer size, I merge into level 2.  I don't do any merges beyond that.
Right now N = 14 and M = 7 but I haven't really tuned them yet ...

Once RAM is full, all of those segments are merged into a single
on-disk segment.  Once enough on-disk segments accumulate they are
periodically merged (based on flushedMergeFactor) as well.  Finally
when it's time to commit a real segment I merge all RAM segments and
flushed segments into a real Lucene segment.

I haven't done much testing to find sweet spot for these merge
settings just yet.  Still plenty to do!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message