lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andi Vajda <>
Subject Re: DbDirectory and compound files
Date Thu, 30 Sep 2004 16:35:26 GMT

> The purpose of the compound file implementation is to minimize the number of 
> open files that an IndexReader must keep open.  Instead of 7 + the number of 
> indexed fields files per segement, only a single file must be kept open per 
> segement.  This helps applications which keep lots of unoptimized indexes 
> open.  (It also, and this is more common, helps folks who open a new 
> IndexReader for each query and don't close it.  In this case, opening fewer 
> files gives the garbage collector time to close files before the process runs 
> into its file descriptor limit, inducing a flurry of but reports about "too 
> many open files".)
> Does that make any more sense?

Yes, thanks for the explanation. This confirms that the compound file 
implementation is not that useful when used in conjunction with the 
DbDirectory implementation since the only open OS files are the ones opened 
by Berkeley DB, ie, the two db files + some log files if transactions 
are used. The number of OS files open is more or less constant, is controlled 
by the Berkeley DB environment and is independant of the number of 
IndexWriter instances open.
This thinking would also apply to RAMDirectory. No files are open at all in 
that case, right ?

> These changes are back-compatible: the old classes and methods are still 
> there and interoperate with the new but are deprecated.  You might wait until 
> there is a Lucene release with the new API in it before you update 
> DbDirectory.  To move to the new API, all that should be required is changing 
> your subclass of InputStream to instead subclass BufferedIndexInput, and also 
> change your subclass of IndexOutput to instead subclass BufferedIndexOutput. 
> You'll also need to add a length() method to your BufferedIndexInput 
> subclass, instead of setting a protected length field in the constructor. 
> That's it.

Cool, that should be easy enough.

> The revision of the API was primarily to make buffering optional.  We could 
> have left the buffered implementation names the same, but then the classes 
> would be named poorly and it also seemed like an opportunity to remove the 
> name clash with

This point about buffering brings up another point. Currently, there is no 
public way to tell the open IndexWriter to flush its Directory. This makes it 
difficult to use several transactions during the lifetime of the IndexWriter.
For example, it would be good if after each indexing operation, the 
Berkely DB transaction could be committed. For that to work though, the 
DbDirectory buffers have to be flushed first. There is no public API available 
at the moment to tell the IndexWriter to make this happen.
It seems that you're saying that this situation is improved with the new 
index IO classes since buffering was made optional ?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message