lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Lamy <t.l...@cytainment.de>
Subject Re: leader split-brain at least once a day - need help
Date Mon, 12 Jan 2015 12:34:11 GMT
Hi,

I found no big/unusual GC pauses in the Log (at least manually; I found 
no free solution to analyze them that worked out of the box on a 
headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G 
before) on one of the nodes, after checking allocation after 1 hour run 
time was at about 2-3GB. That didn't move the time frame where a restart 
was needed, so I don't think Solr's JVM GC is the problem.
We're trying to get all of our node's logs (zookeeper and solr) into 
Splunk now, just to get a better sorted view of what's going on in the 
cloud once a problem occurs. We're also enabling GC logging for 
zookeeper; maybe we were missing problems there while focussing on solr 
logs.

Thomas


Am 08.01.15 um 16:33 schrieb Yonik Seeley:
> It's worth noting that those messages alone don't necessarily signify
> a problem with the system (and it wouldn't be called "split brain").
> The async nature of updates (and thread scheduling) along with
> stop-the-world GC pauses that can change leadership, cause these
> little windows of inconsistencies that we detect and log.
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions,
> sub-facets, off-heap data
>
>
> On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t.lamy@cytainment.de> wrote:
>> Hi there,
>>
>> we are running a 3 server cloud serving a dozen
>> single-shard/replicate-everywhere collections. The 2 biggest collections are
>> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
>> 7.0.56, Oracle Java 1.7.0_72-b14
>>
>> 10 of the 12 collections (the small ones) get filled by DIH full-import once
>> a day starting at 1am. The second biggest collection is updated usind DIH
>> delta-import every 10 minutes, the biggest one gets bulk json updates with
>> commits once in 5 minutes.
>>
>> On a regular basis, we have a leader information mismatch:
>> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
>> is coming from leader, but we are the leader
>> or the opposite
>> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
>> says we are the leader, but locally we don't think so
>>
>> One of these pop up once a day at around 8am, making either some cores going
>> to "recovery failed" state, or all cores of at least one cloud node into
>> state "gone".
>> This started out of the blue about 2 weeks ago, without changes to neither
>> software, data, or client behaviour.
>>
>> Most of the time, we get things going again by restarting solr on the
>> current leader node, forcing a new election - can this be triggered while
>> keeping solr (and the caches) up?
>> But sometimes this doesn't help, we had an incident last weekend where our
>> admins didn't restart in time, creating millions of entries in
>> /solr/oversser/queue, making zk close the connection, and leader re-elect
>> fails. I had to flush zk, and re-upload collection config to get solr up
>> again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
>>
>> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
>> requests/s) up and running, which does not have these problems since
>> upgrading to 4.10.2.
>>
>>
>> Any hints on where to look for a solution?
>>
>> Kind regards
>> Thomas
>>
>> --
>> Thomas Lamy
>> Cytainment AG & Co KG
>> Nordkanalstrasse 52
>> 20097 Hamburg
>>
>> Tel.:     +49 (40) 23 706-747
>> Fax:     +49 (40) 23 706-139
>> Sitz und Registergericht Hamburg
>> HRA 98121
>> HRB 86068
>> Ust-ID: DE213009476
>>


-- 
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476


Mime
View raw message