lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf>
Subject Re: Lucene-based Distributed Index Leveraging Hadoop
Date Fri, 22 Aug 2008 10:42:34 GMT

> In terms of which project best fits my needs my gut feeling is that
> dlucene is pretty close. It supports incremental updates, and doesn't
> build in dependencies on systems like HDFS or Terracotta (I don't yet
> understand all the implications of those systems so would rather keep
> things simple if possible).

The way we solve this with katta is that we simply deploy a new small  
index and use * in the client instead of a fixed index name.
Than once a night we merge all the small indexes (since this slows  
down things) together to a big new index.
To solve the problem of duplicate documents each document gets a  
timestamp and in the client we do a simple dedub based on a key and  
use always the latest document with the latest time stamp.

Katta is independent of those technologies, it is lucene, zookeeper  
and hadoop RPI (instead of RMI, http or Apache Mina). Though we  
support loading index shards from a hadoop file system, but you also  
can load them from a mounted remote hdd NAS or what ever you like

> The obvious drawback being that dlucene
> doesn't seem to be an active public project.
Mark need to answer this but dlucene is checked in to the katta svn  
and I saw Marko checking in changes to dlucene. There was a discussion  
between Mark and me to bring dlucene and katta together and I really  
would love to see that happen but unfortunately we had a lot of  
pressure from our customer to deliver something so we had to focus on  
other things. More developers getting involved would clearly help  
here.. :-)

> Thanks for the reply Stefan. I'll certainly be taking a look through
> the code for Katta since no doubt there's a lot to learn in there.
Katta will be deployed into a production system of our customer in  
less than 4 weeks - so we working hard to iron out issues.
However katta is running since 6 weeks in a 10 node test environment  
with heavy load.


View raw message