lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal Patwa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-5961) Solr gets crazy on /overseer/queue state change
Date Fri, 06 Feb 2015 06:51:35 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308716#comment-14308716
] 

Gopal Patwa commented on SOLR-5961:
-----------------------------------

we also had similar problem today as Ugo mention in our Production system, this was cause
after machine reboot for zookeeper (5 node) and 8 node solr cloud (single shard) to install
some unix security patch.

JDK 7, Solr 4.10.3, CentOS

But after reboot, we saw huge amount of message were in overseer/queue

./zkCli.sh -server localhost:2181 ls /search/catalog/overseer/queue  | sed 's/,/\n/g' | wc
-l
178587

We have very small cluster (8 nodes), how come overseer/queue should have 17k+ messages, due
to this leader node took almost few hours to come from recovery.

Logs from zookeeper:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /overseer/queue


> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>
>                 Key: SOLR-5961
>                 URL: https://issues.apache.org/jira/browse/SOLR-5961
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>         Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes (separate
machines)
>            Reporter: Maxim Novikov
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>
> No idea how to reproduce it, but sometimes Solr stars littering the log with the following
messages:
> 419158 [localhost-startStop-1-EventThread] INFO  org.apache.solr.cloud.DistributedQueue
 ? LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO  org.apache.solr.cloud.Overseer  ? Update state numShards=1 message={
>   "operation":"state",
>   "state":"recovering",
>   "base_url":"http://${IP_ADDRESS}/solr",
>   "core":"${CORE_NAME}",
>   "roles":null,
>   "node_name":"${NODE_NAME}_solr",
>   "shard":"shard1",
>   "collection":"${COLLECTION_NAME}",
>   "numShards":"1",
>   "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all the nodes
does not help. I have even tried to stop all the nodes in the cluster first, but then when
I start one, the behavior doesn't change, it gets crazy nuts with this " /overseer/queue state"
again.
> PS The only way to handle this was to stop everything, manually clean up all the data
in ZooKeeper related to Solr, and then rebuild everything from scratch. As you should understand,
it is kinda unbearable in the production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message