lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal Patwa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-5961) Solr gets crazy on /overseer/queue state change
Date Mon, 09 Feb 2015 01:04:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311681#comment-14311681
] 

Gopal Patwa commented on SOLR-5961:
-----------------------------------

Thanks Mark, here more details for our production issue, I will try to reproduce this issue.

Restart sequence:
Solr - Restarted 02/03/3015 (8 Nodes, 10 Collection)
ZKR - Restarted 02/04/2015 (5 Nodes)

Normal Index Size are approx 5GB. Only few nodes had this issue

When replica was in recovery, transaction logs size was over 100GB.
Possible reason it starts writing all updates sent by the leader in this period to the transaction
log .

Due to overseer queue size large, Admin UI Cloud tree view hangs, may be similar to below
jira issue
https://issues.apache.org/jira/browse/SOLR-6395

Exceptions During this time:

Zookeeper Log:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /overseer/queue

Solr Log:

2015-02-05 23:23:13,174 [] priority=ERROR app_name= thread=RecoveryThread location=RecoveryStrategy
line=142 Error while trying to recover. core=city_shard1_replica2:java.util.concurrent.ExecutionException:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait
on state recovering for shard1 in city on srwp01usc002.stubprod.com:8080_solr but I still
do not see the requested state. I see state: active live:true leader from ZK: http://srwp01usc001.stubprod.com:8080/solr/city_shard1_replica1/
 at java.util.concurrent.FutureTask.report(FutureTask.java:122)
 at java.util.concurrent.FutureTask.get(FutureTask.java:188)
 at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:615)

> Solr gets crazy on /overseer/queue state change
> -----------------------------------------------
>
>                 Key: SOLR-5961
>                 URL: https://issues.apache.org/jira/browse/SOLR-5961
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>         Environment: CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes (separate
machines)
>            Reporter: Maxim Novikov
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>
> No idea how to reproduce it, but sometimes Solr stars littering the log with the following
messages:
> 419158 [localhost-startStop-1-EventThread] INFO  org.apache.solr.cloud.DistributedQueue
 ? LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
> 419190 [Thread-3] INFO  org.apache.solr.cloud.Overseer  ? Update state numShards=1 message={
>   "operation":"state",
>   "state":"recovering",
>   "base_url":"http://${IP_ADDRESS}/solr",
>   "core":"${CORE_NAME}",
>   "roles":null,
>   "node_name":"${NODE_NAME}_solr",
>   "shard":"shard1",
>   "collection":"${COLLECTION_NAME}",
>   "numShards":"1",
>   "core_node_name":"core_node2"}
> It continues spamming these messages with no delay and the restarting of all the nodes
does not help. I have even tried to stop all the nodes in the cluster first, but then when
I start one, the behavior doesn't change, it gets crazy nuts with this " /overseer/queue state"
again.
> PS The only way to handle this was to stop everything, manually clean up all the data
in ZooKeeper related to Solr, and then rebuild everything from scratch. As you should understand,
it is kinda unbearable in the production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message