lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Stewart <>
Subject SOLR architecture recommendation
Date Tue, 27 Sep 2011 15:40:41 GMT
I need some recommendations for a new SOLR project.

We currently have a large (200M docs) production system using Lucene.Net and what I would
call our own .NET implementation of SOLR (built early on when SOLR was less mature and did
not run as well on Windows).  

Our current architecture works like this:

We have 2 master updater servers, which build Lucene indexes, and they both are live and reading
new documents to index from a shared queue (RabbitMQ).  Unlike SOLR we don't POST new documents
to the servers via HTTP, instead we let the servers read from the shared queue and index documents
as fast as they can.  This gives us scale and failover support, and basic natural load balancing.
 Each updater makes a new index snapshot on some configurable interval (typically 1 minute).
 During index snapshot, the indexing thread is blocked so that new documents are not read
from queue during that time.  A new snapshot uses NTFS hard links to replicate the master
index into a new directory on the master server.

We have 4 searcher servers, which each pull snapshots from both updaters and then search both
indexes in parallel and merge results.  Each searcher server is identical (they search same
data), and sits behind a load balancer.  Searchers have a background thread which continuously
looks in remote directories on master servers looking for new snapshot directories.  When
it sees a new snapshot exists, it uses NTFS hard links to replicate the current local snapshot
and pull down only files that changed (very similar to SOLR collection replication).  Then
switches searches over to new snapshot(s).

From the search client perspective, they simply issue HTTP GET request to the load balancer,
and have no idea how many masters/shards that exist.  Results are merged/resorted/de-duped
from all indexes transparently on the searchers.

My big question is, how can I very closely replicate this architecture with the latest version
of SOLR?  We don't need to replace this system but to implement a similar system for another
client using SOLR.  We really like our existing system architecture because it provides very
natural load balancing and sharding on the masters and provides nice failover support for
both masters and searchers.  We'd like to avoid search clients knowing about shards, and to
avoid explicitly posting HTTP requests to SOLR servers when adding documents, because we like
the more natural failover and load balancing of using a shared queue.

More specific questions:

How can SOLR in master/updater mode be configured to read new documents from a queue?  Is
this possible with a custom Data Import Handler, or do I need to develop some seperate service
which reads from a queue and then POSTS via HTTP to SOLR?

What is the best way to configure searchers to be able to merge results from more than one
index?  Is it possible to configure which shards to search by default, rather than forcing
the client to know about shards and specify shards in search request?  Can I do this with
a custom request handler, and if so specifically how?

Thanks a lot!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message