lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason rutherglen <>
Subject Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
Date Wed, 06 Sep 2006 19:08:54 GMT
Sounds interesting Marvin, I would be willing to test out what you create.  I am working on
trying creating a rapidly updating index and it sounds like this may help that.  I've noticed
even using a ramdisk that the whole merging process is quite slow.  Maybe also because of
the locking that occurs the CPU is not maxed out either.  Seems like there is a lot of room
for optimization.  Cheers.

----- Original Message ----
From: Marvin Humphrey <>
Sent: Wednesday, September 6, 2006 11:35:59 AM
Subject: Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code
and Performance Results Provided)

On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote:

> So it looks like you have intermediate things that aren't lucene
> segments, but end up producing valid lucene segments at the end of a
> session?

That's one way of thinking about it.  There's only one "thing"  
though: a big bucket of serialized index entries.  At the end of a  
session, those are sorted, pulled apart, and used to write the tis,  
tii, frq, and prx files.

Everything else (e.g. stored fields) gets written incrementally as  
documents get added.  The fact that stored fields don't get shuffled  
around is one of this algorithm's advantages (along with much lower  
memory requirements, etc).

> For Java lucene, I think the biggest indexing gain could be had by not
> buffering using single doc segments, but something optimized for
> in-memory single segment creation.

In theory, you could apply this technique only to a limited number of  
docs and create segments, say, 10 docs at a time rather than 1 at a  
time.  But then you still have to do something with each 10 doc  
segment, and you don't get the benefits of less disk shuffling and  
lower RAM usage.  Better to just create 1 segment per session.

> The downside is complexity... two
> sets of "merge" code.

KS doesn't have SegmentMerger.  :)

> It would be interesting to see an IndexWriter2 for full Gordian Knot
> cutting like you do :-)

I've already contributed a Java port of KinoSearch's external sorter  
(along with its tests), which is the crucial piece.  The rest isn't  
easy, but stay tuned.  ;)

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message