lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cassandra Targett (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy
Date Fri, 27 Jan 2017 19:59:24 GMT

     [ https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Cassandra Targett closed SOLR-6056.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 5.0

Per Ishan's last comment, it seems part of this issue was committed, and the other part was
fixed in SOLR-8371.

> Zookeeper crash JVM stack OOM because of recover strategy 
> ----------------------------------------------------------
>
>                 Key: SOLR-6056
>                 URL: https://issues.apache.org/jira/browse/SOLR-6056
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.6
>         Environment: Two linux servers, 65G memory, 16 core cpu
> 20 collections, every collection has one shard two replica 
> one zookeeper
>            Reporter: Raintung Li
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>              Labels: cluster, crash, recover
>             Fix For: 5.0
>
>         Attachments: patch-6056.txt
>
>
> Some errors"org.apache.solr.common.SolrException: Error opening new searcher. exceeded
limit of maxWarmingSearchers=2, try again later", that occur distributedupdateprocessor trig
the core admin recover process.
> That means every update request will send the core admin recover request.
> (see the code DistributedUpdateProcessor.java doFinish())
> The terrible thing is CoreAdminHandler will start a new thread to publish the recover
status and start recovery. Threads increase very quickly, and stack OOM , Overseer can't handle
a lot of status update , zookeeper node for  /overseer/queue/qn-0000125553 increase more than
40 thousand in two minutes.
> At the last zookeeper crash. 
> The worse thing is queue has too much nodes in the zookeeper, the cluster can't publish
the right status because only one overseer work, I have to start three threads to clear the
queue nodes. The cluster doesn't work normal near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message