lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Solr cloud recovery, why does restarting leader need replicas?
Date Wed, 28 Nov 2012 16:58:38 GMT
This is a protective measure. When it looks like a shard is first coming up, we wait to see
all the expected shards, or for a timeout, to ensure that everyone participates in the initial
sync process - if all the nodes went down, we don't know what documents made it where, and
we don't want to lose any updates.

- Mark

On Nov 28, 2012, at 10:47 AM, Daniel Collins <danwcollins@gmail.com> wrote:

> I was testing the basic SolrCloud test scenario from the wiki page, and
> found something (I considered) unexpected.
> 
> If the leader of the shard goes down, when it comes back up it requires N
> replicas to be running (where N is determined from what was running before
> I think).
> 
> Simple setup, 4 servers, 2 shards (A, B), each with 2 replicas, e.g. A1,
> A2, B1, B2.
> 
> All 4 nodes start-up, A1, B1 are leaders, all is well.
> 
> A2 brought down, cloud is still fine. A2 brought back up and recovers, once
> recovery complete, it is live.
> 
> A2 goes down, then A1.  Cloud is now unresponsive as Shard A has no nodes
> (as expected).
> 
> A1 comes back up.  However, shard is still not responsive due to errors
> 
> 2012-11-28 10:45:27,328 INFO [main] o.a.s.c.ShardLeaderElectionContext
> [ElectionContext.java:287] Waiting until we see more replicas up: total=2
> found=1 timeoutin=140262
> 
> I can understand that in the cloud setup A1 (if it wasn't the leader) would
> have to recover, but as A1 was leader when it went down, shouldn't it be
> able to service requests on its own (it was when it went down!)


Mime
View raw message