lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Lucene-based Distributed Index Leveraging Hadoop
Date Thu, 07 Feb 2008 19:53:45 GMT
Doug Cutting wrote:
> Ning,
> I am also interested in starting a new project in this area.  The 
> approach I have in mind is slightly different, but hopefully we can come 
> to some agreement and collaborate.

I'm interested in this too.

> My current thinking is that the Solr search API is the appropriate 
> model.  Solr's facets are an important feature that require low-level 
> support to be practical.  Thus a useful distributed search system should 
> support facets from the outset, rather than attempt to graft them on 
> later.  In particular, I believe this requirement mandates disjoint shards.

I agree - shards should be disjoint also because if we eventually want 
to manage multiple replicas of each shard across the cluster (for 
reliability and performance) then overlapping documents would complicate 
both the query dispatching process and the merging of partial result sets.

> My primary difference with your proposal is that I would like to support 
> online indexing.  Documents could be inserted and removed directly, and 
> shards would synchronize changes amongst replicas, with an "eventual 
> consistency" model.  Indexes would not be stored in HDFS, but directly 
> on the local disk of each node.  Hadoop would perhaps not play a role. 
> In many ways this would resemble CouchDB, but with explicit support for 
> sharding and failover from the outset.

It's true that searching over HDFS is slow - but I'd hate to lose all 
other HDFS benefits and have to start from scratch ... I wonder what 
would be the performance of FsDirectory over an HDFS index that is 
"pinned" to a local disk, i.e. a full local replica is available, with 
block size of each index file equal to the file size.

> A particular client should be able to provide a consistent read/write 
> view by bonding to particular replicas of a shard.  Thus a user who 
> makes a modification should be able to generally see that modification 
> in results immediately, while other users, talking to different 
> replicas, may not see it until synchronization is complete.

This requires that we use versioning, and that we have a "shard manager" 
that knows the latest versions of each shard among the whole active set 
- or that clients discover this dynamically by querying the shard 
servers every now and then.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message