lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Woodward <>
Subject Re: leader split-brain at least once a day - need help
Date Wed, 07 Jan 2015 14:46:17 GMT
I had a similar issue, which was caused by
 Are you getting long GC pauses or similar before the leader mismatches occur?

Alan Woodward

On 7 Jan 2015, at 10:01, Thomas Lamy wrote:

> Hi there,
> we are running a 3 server cloud serving a dozen single-shard/replicate-everywhere collections.
The 2 biggest collections are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK
3.4.5, Tomcat 7.0.56, Oracle Java 1.7.0_72-b14
> 10 of the 12 collections (the small ones) get filled by DIH full-import once a day starting
at 1am. The second biggest collection is updated usind DIH delta-import every 10 minutes,
the biggest one gets bulk json updates with commits once in 5 minutes.
> On a regular basis, we have a leader information mismatch:
> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is coming
from leader, but we are the leader
> or the opposite
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are
the leader, but locally we don't think so
> One of these pop up once a day at around 8am, making either some cores going to "recovery
failed" state, or all cores of at least one cloud node into state "gone".
> This started out of the blue about 2 weeks ago, without changes to neither software,
data, or client behaviour.
> Most of the time, we get things going again by restarting solr on the current leader
node, forcing a new election - can this be triggered while keeping solr (and the caches) up?
> But sometimes this doesn't help, we had an incident last weekend where our admins didn't
restart in time, creating millions of entries in /solr/oversser/queue, making zk close the
connection, and leader re-elect fails. I had to flush zk, and re-upload collection config
to get solr up again (just like in
> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 requests/s)
up and running, which does not have these problems since upgrading to 4.10.2.
> Any hints on where to look for a solution?
> Kind regards
> Thomas
> -- 
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
> Tel.:     +49 (40) 23 706-747
> Fax:     +49 (40) 23 706-139
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message