lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <>
Subject Efficiently expunging deletions of recently added documents
Date Mon, 04 Dec 2006 21:15:46 GMT
Hi All,

I'd like to open up the API to mergeSegments() in IndexWriter and am
wondering if there are potential problems with this.

I use ParallelReader and ParallelWriter (in jira) extensively as these
provide the basis for fast bulk updates of small metadata fields. 
ParallelReader requires that the subindexes be strictly synchronized by
matching doc ids.  The thorniest problem arises when writing a new
document (with ParallelWriter) generates an exception in some of the
subindexes but not others, as this leaves the subindexes out of sync.

I have recovery for this now that works by deleting the successfully
added subdocuments that are parallel to any unsuccessful subdocument and
then optimizing to expunge the unsuccessful doc-id from those segments
where it had been added.  Optimization is prohibitively expensive for
large indexes, and unnecessary for this recovery.

A much better solution is to have an API in IndexWriter to expunge a
given set of deleted doc ids.  This could merge only enough recent
segments to fully encompass the specified docs, which in this case is
not much since they will be recently added.  The result should be orders
of magnitude performance improvement to the recovery.

I'm planning to make this change and submit a patch for it unless I've
missed something that somebody can point out.  At the same time, I'll
update the ParallelWriter submission as there are a number of bug fixes
plus a substantial general (non-recovery-case) performance improvement
I've just identified and am about to implement.

Thanks for any thoughts. suggestions, or problems you can point out.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message