lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal Patwa (JIRA)" <>
Subject [jira] [Commented] (SOLR-5961) Solr gets crazy on /overseer/queue state change
Date Fri, 06 Feb 2015 06:51:35 GMT


Gopal Patwa commented on SOLR-5961:

we also had similar problem today as Ugo mention in our Production system, this was cause
after machine reboot for zookeeper (5 node) and 8 node solr cloud (single shard) to install
some unix security patch.

JDK 7, Solr 4.10.3, CentOS

But after reboot, we saw huge amount of message were in overseer/queue

./ -server localhost:2181 ls /search/catalog/overseer/queue  | sed 's/,/\n/g' | wc

We have very small cluster (8 nodes), how come overseer/queue should have 17k+ messages, due
to this leader node took almost few hours to come from recovery.

Logs from zookeeper:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /overseer/queue

> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>                 Key: SOLR-5961
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>         Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes (separate
>            Reporter: Maxim Novikov
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
> No idea how to reproduce it, but sometimes Solr stars littering the log with the following
> 419158 [localhost-startStop-1-EventThread] INFO
 ? LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO  ? Update state numShards=1 message={
>   "operation":"state",
>   "state":"recovering",
>   "base_url":"http://${IP_ADDRESS}/solr",
>   "core":"${CORE_NAME}",
>   "roles":null,
>   "node_name":"${NODE_NAME}_solr",
>   "shard":"shard1",
>   "collection":"${COLLECTION_NAME}",
>   "numShards":"1",
>   "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all the nodes
does not help. I have even tried to stop all the nodes in the cluster first, but then when
I start one, the behavior doesn't change, it gets crazy nuts with this " /overseer/queue state"
> PS The only way to handle this was to stop everything, manually clean up all the data
in ZooKeeper related to Solr, and then rebuild everything from scratch. As you should understand,
it is kinda unbearable in the production environment.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message