lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
Date Wed, 06 Sep 2006 18:35:59 GMT

On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote:

> So it looks like you have intermediate things that aren't lucene
> segments, but end up producing valid lucene segments at the end of a
> session?

That's one way of thinking about it.  There's only one "thing"  
though: a big bucket of serialized index entries.  At the end of a  
session, those are sorted, pulled apart, and used to write the tis,  
tii, frq, and prx files.

Everything else (e.g. stored fields) gets written incrementally as  
documents get added.  The fact that stored fields don't get shuffled  
around is one of this algorithm's advantages (along with much lower  
memory requirements, etc).

> For Java lucene, I think the biggest indexing gain could be had by not
> buffering using single doc segments, but something optimized for
> in-memory single segment creation.

In theory, you could apply this technique only to a limited number of  
docs and create segments, say, 10 docs at a time rather than 1 at a  
time.  But then you still have to do something with each 10 doc  
segment, and you don't get the benefits of less disk shuffling and  
lower RAM usage.  Better to just create 1 segment per session.

> The downside is complexity... two
> sets of "merge" code.

KS doesn't have SegmentMerger.  :)

> It would be interesting to see an IndexWriter2 for full Gordian Knot
> cutting like you do :-)

I've already contributed a Java port of KinoSearch's external sorter  
(along with its tests), which is the crucial piece.  The rest isn't  
easy, but stay tuned.  ;)

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message