lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Boston <>
Subject Re: GData Server - Lucene storage
Date Wed, 07 Jun 2006 08:51:36 GMT
Im picking this thread up from the web archive, but I there was some 
talk of replication of indexes. This message may not be threaded 
correctly. I've just completed a custom FSDirectory implementation that 
is designed to work in a cluster with replication.

The anatomy of this cluster is a shared database (mysql or oracle) and 
stateless nodes with local disk storage. The index load is not that high 
(when you look at big Nutch installations), but not tiny either, maybe 
1TB of raw, with an index of 10GB (a guess).

I would have used rsync, but ideally I wanted it to work with no sys 
admin setup (pure java install). I looked at, and really liked NDFS but 
decided it was too much admin over head to setup. The deployers like to 
do maven build deploy; tomcat/ start to get up and running 
(easy life!)

Indexing is performed using a queue (persisted in the db), with a 
distributed lock manager allowing one of the nodes in the cluster to 
take responsibility for indexing, notifying all other nodes when done. 
(then they reload the index). This happens every few minutes in production.

FSDirectory is efficient and fast, and I wanted that in the cluster. I 
looked at JDBCDirectory (from compass framework) but found that even 
with a non compound index, the DB overhead was just too great, (on 
average 1/10 performance on MySQL compared to local Disk, Oracle might 
be better) the problem mainly being seeks into blobs. I guess the 
Berkley DB Directory is going to be similar in some ways except the 
seeks may be more efficient.

Eventually I borrowed some concepts from Nutch. The index writer writes 
a new segment with FSDirectory, then merges into the current segment, 
that segment is compresses and checksumed (MD5) and sent to the 
database. Current segments are rotated when they get over 2M. When a 
node recieves an index reload event, it syncs its local segments with 
the DB, and loads them with a MultiReader using FSDirectory.

The sweet spot features are.

Performance is almost the same as FSDirectory, except the end of the 
IndexWrite operation and the start of the IndexReader operation has 
slightly more overhead.

When nodes are added to the cluster, they can validate there local 
segment copies and bring them uptodate against the cluster.

There is a real time backup of the the index.

The segments are validated prior to being send to the DB.

You could easily use a SAN/NAS in place of the Db to ship the segments.


I haven't done real heavy production tests, but I have had it running 
indexing the contents of my hard disk flat our for over 48 hours with 
200, 2MB segments in the DB.

There is probably some housekeeping (eg merging) that should be done, 
and not being a Lucene expert, I am bound to have missed something.

If anyone spots anything, please let me know :)


If your interested you can find the code at

The Distributed Lock manager is at

The Indexer is at

and the JDBC Index shipper is at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message