lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@heliosearch.com>
Subject Re: leader split-brain at least once a day - need help
Date Thu, 08 Jan 2015 15:33:22 GMT
It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called "split brain").
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t.lamy@cytainment.de> wrote:
> Hi there,
>
> we are running a 3 server cloud serving a dozen
> single-shard/replicate-everywhere collections. The 2 biggest collections are
> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
> 7.0.56, Oracle Java 1.7.0_72-b14
>
> 10 of the 12 collections (the small ones) get filled by DIH full-import once
> a day starting at 1am. The second biggest collection is updated usind DIH
> delta-import every 10 minutes, the biggest one gets bulk json updates with
> commits once in 5 minutes.
>
> On a regular basis, we have a leader information mismatch:
> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
> is coming from leader, but we are the leader
> or the opposite
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
> says we are the leader, but locally we don't think so
>
> One of these pop up once a day at around 8am, making either some cores going
> to "recovery failed" state, or all cores of at least one cloud node into
> state "gone".
> This started out of the blue about 2 weeks ago, without changes to neither
> software, data, or client behaviour.
>
> Most of the time, we get things going again by restarting solr on the
> current leader node, forcing a new election - can this be triggered while
> keeping solr (and the caches) up?
> But sometimes this doesn't help, we had an incident last weekend where our
> admins didn't restart in time, creating millions of entries in
> /solr/oversser/queue, making zk close the connection, and leader re-elect
> fails. I had to flush zk, and re-upload collection config to get solr up
> again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
>
> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
> requests/s) up and running, which does not have these problems since
> upgrading to 4.10.2.
>
>
> Any hints on where to look for a solution?
>
> Kind regards
> Thomas
>
> --
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
>
> Tel.:     +49 (40) 23 706-747
> Fax:     +49 (40) 23 706-139
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476
>

Mime
View raw message