lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakov Sosic <jso...@gmail.com>
Subject solr cloud going down repeatedly
Date Mon, 18 Aug 2014 17:30:55 GMT
Hi guys.

I have a solr cloud, consisting of 3 zookeper VMs running 3.4.5 
backported from Ubuntu 14.04 LTS to 12.04 LTS.

They are orchestrating 4 solr nodes, which have 2 cores. Each core is 
sharded, so 1 shard is on each of the solr nodes.

Solr runs under tomcat7 and ubuntus latest openjdk 7.

Version of solr is 4.2.1.

Each of the nodes have around 7GB of data, and JVM is set to run 8GB 
heap. All solr nodes have 16GB RAM.


Few weeks back we started having issues with this installation. Tomcat 
was filling up catalina.out with following messages:

SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:


Only solution was to restart all 4 tomcats on 4 solr nodes. After that, 
issue would rectify itself, but would occur again, approximately a week 
after a restart.

This happened last time yesterday, and I succeded in recording some of 
the stuff happening on boxes via Zabbix and atop.


Basically at 15:35 load on machine went berzerk, jumping from around 0.5 
to around 30+

Zabbix and atop didn't notice any heavy IO, all the other processes were 
practicaly idle, only JVM (tomcat) exploded with cpu usage increasing 
from standard ~80% to around ~750%

These are the parts of Atop recordings on one of the node. Note that 
they are 10 mins appart:

(15:28:42)
CPL | avg1    0.12  |               | avg5    0.36  | avg15   0.38  |

(15:38:42)
CPL | avg1    8.54  |               | avg5    3.62  | avg15   1.61  |

(15:48:42)
CPL | avg1   30.14  |               | avg5   27.09  | avg15  14.73  |



This is the status of tomcat at last point (15:48:42):
28891        tomcat7         tomcat7          411          8.68s  70m14s 
        209.9M          204K            0K         5804K --          - 
       S            5        704%        java


I have noticed similar stuff happening around the solr nodes. At 17:41 
on call person decided to hard reset all the solr nodes, and cloud came 
back up running normally after that.

These are the logs that I found on first node:

Aug 17, 2014 3:44:58 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

Aug 17, 2014 3:46:12 PM 
org.apache.solr.cloud.OverseerCollectionProcessor run
WARNING: Overseer cannot talk to ZK
Aug 17, 2014 3:46:12 PM 
org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
WARNING:
org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /overseer_elect/leader

Then a bunch of :

Aug 17, 2014 3:46:42 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

until the server was rebooted.


On other nodes I can see:
node2:

Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for 
zkNodeName=10.100.254.103:8080_solr_myappcore=myapp
Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for 
zkNodeName=10.100.254.103:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:46:24 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: IOException occured 
when talking to server at: http://node1:8080/solr/myapp

node4:

Aug 17, 2014 3:44:06 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for 
zkNodeName=10.100.254.105:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:44:09 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for 
zkNodeName=10.100.254.105:8080_solr_myappcore=myapp
Aug 17, 2014 3:45:37 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in 
zk:org.apache.solr.common.SolrException: Could not get leader props




My impression is that garbage collector is at fault here.

This is the cmdline of tomcat:

/usr/lib/jvm/java-7-openjdk-amd64/bin/java 
-Djava.util.logging.config.file=/var/lib/tomcat7/conf/logging.properties 
-Djava.awt.headless=true -Xmx8192m -XX:+UseConcMarkSweepGC -DnumShards=2 
-Djetty.port=8080 
-DzkHost=10.215.1.96:2181,10.215.1.97:2181,10.215.1.98:2181 
-javaagent:/opt/newrelic/newrelic.jar -Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=9010 
-Dcom.sun.management.jmxremote.local.only=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djav 
.endorsed.dirs=/usr/share/tomcat7/endorsed -classpath 
/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli.jar 
-Dcatalina.base=/var/lib/tomcat7 -Dcatalina.home=/usr/share/tomcat7 
-Djava.io.tmpdir=/tmp/tomcat7-tomcat7-tmp 
org.apache.catalina.startup.Bootstrap start


So, I am using MarkSweepGC.

Do you have any suggestion how can I debug this further and potentially 
eliminate the issue causing downtimes?

Mime
View raw message