lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Oliver" <>
Subject RE: Faking index merge by modifying segments file?
Date Tue, 01 Nov 2005 17:17:45 GMT
Hello Otis,

I worked on a similar issue a couple on months ago. I've included our
email conversation below. 

Hopefully, your thread will prompt more interest from the mailing list. 


Sort of -- but only within a very controlled situation along with some
hackery you can comment out both of them. 

Here's what I had to do -- in pseudo code.

Create a new IndexWriter subclass, call it IndexWriter2 that gets its
segmentCounter initialized to the real (the actual pre-existing index)
index's segmentCounter - 1. I suspect this part is not very robust, as I
don't completely understand why I needed to subtract 1 (it has something
to do with the temporary RAMDirectory that gets used before actually
getting written to disk).

Add the documents into our IndexWriter2 so that they get properly
written to a separate place on the file system and get the correct
"next" segment names that would appear in the real index.

Within the addIndexes() loop over the dirs, you move all the newly
created files from their current location over to the real indexes file
directory. This part in particular feels very hacked.

Finally, instead of calling optimize() at the end of addIndexes(), you
rewrite the segments file so that it includes all these new segments. 

One other note is that if I wasn't using compound files, then I _think_
I could just rename the all of the files when they get moved into the
real index's file directory. But, compound files create their internal
files using the segmentName that the segment was created with, thus
creating a mismatch when you rename it externally.


-----Original Message-----
From: Otis Gospodnetic [] 
Sent: Thursday, August 11, 2005 3:36 PM
Subject: Re: Avoiding segment merges during indexing

Kevin - are you saying that you can just comment out the 2 optimize()
calls and addIndexes(Directory[]) will keep working?  I don't recall
why there are optimize() calls again, but I know several people had
issues with it...


--- Kevin Oliver <> wrote:

> This is a proposal that is in need of some insights.
> In an effort to speed up adding documents to an existing index, we
> are
> pursuing using IndexWriter.addIndexes(Directory[]). In theory this
> should work great -- you index your new documents into a new
> Directory,
> then add them into to your existing directory, saving you the time
> spent
> merging segments that would be caused by the normal
> IndexWriter.addDocument(Document) calls during indexing. 
> However, addIndexes() has the property that it calls optimize() both
> before and after adding the new directories. This wipes out the
> performance boost, and then some. 
> So I found a way to work around this, but I don't like what I've had
> to
> do and I was wondering if anybody has any ideas on what could be done
> to
> make this more pleasant.
> It appears that by getting the new segment files into the existing
> directory, with the correct segment names, it will work without all
> of
> the optimize calls. Unfortunately, getting the segment names right
> and
> getting the files into the right location is a big ugly hack and is
> quite fragile.
> Is there a better way? I think maybe some explanation into why the 2
> optimizes are there would help my understanding. Is there a clean way
> of
> doing what I'm proposing? Is there some hidden catch I'm missing and
> I've been going down the wrong path?
> It seems to me this would be a great benefit to anyone who does
> indexing
> on existing indexes and wants it to be fast. 
> Thanks,
> Kevin Oliver

-----Original Message-----
From: Otis Gospodnetic [] 
Sent: Monday, October 31, 2005 11:52 PM
Subject: Faking index merge by modifying segments file?


I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that by simply opening
segments file of one index, and adding segment names of the other
index/indices, plus adjusting the segment size (SegSize in
fileformats.html), thus creating a single (but unoptimized) index.

Any reactions to that?

I imagine this isn't quite that simple to implement, as one would have
to renumber all documents, in order to avoid having multiple documents
with the same document id.

Can anyone think of any other problems with this approach, or perhaps
offer ideas for possible document renumbering?


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message