lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Lamy <t.l...@cytainment.de>
Subject leader split-brain at least once a day - need help
Date Wed, 07 Jan 2015 10:01:29 GMT
Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections 
are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, 
Tomcat 7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import 
once a day starting at 1am. The second biggest collection is updated 
usind DIH delta-import every 10 minutes, the biggest one gets bulk json 
updates with commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request 
says it is coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; 
ClusterState says we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores 
going to "recovery failed" state, or all cores of at least one cloud 
node into state "gone".
This started out of the blue about 2 weeks ago, without changes to 
neither software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the 
current leader node, forcing a new election - can this be triggered 
while keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where 
our admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader 
re-elect fails. I had to flush zk, and re-upload collection config to 
get solr up again (just like in 
https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 
1500 requests/s) up and running, which does not have these problems 
since upgrading to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

-- 
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476


Mime
View raw message