lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkumar Aiyengar (JIRA)" <>
Subject [jira] [Commented] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy
Date Sat, 14 Jun 2014 15:25:02 GMT


Ramkumar Aiyengar commented on SOLR-6056:

SOLR-5593 was one more place where we had debated taking out the quick publish, but decided
against it as you increase the chance of an unhealthy replica becoming the leader and losing
updates. That argument probably still holds, if the leader is sending tons of recovery requests,
couldn't either the leader or the replica not send/accept one while a recovery is in progress?

> Zookeeper crash JVM stack OOM because of recover strategy 
> ----------------------------------------------------------
>                 Key: SOLR-6056
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.6
>         Environment: Two linux servers, 65G memory, 16 core cpu
> 20 collections, every collection has one shard two replica 
> one zookeeper
>            Reporter: Raintung Li
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>              Labels: cluster, crash, recover
>         Attachments: patch-6056.txt
> Some errors"org.apache.solr.common.SolrException: Error opening new searcher. exceeded
limit of maxWarmingSearchers=2, try again later", that occur distributedupdateprocessor trig
the core admin recover process.
> That means every update request will send the core admin recover request.
> (see the code doFinish())
> The terrible thing is CoreAdminHandler will start a new thread to publish the recover
status and start recovery. Threads increase very quickly, and stack OOM , Overseer can't handle
a lot of status update , zookeeper node for  /overseer/queue/qn-0000125553 increase more than
40 thousand in two minutes.
> At the last zookeeper crash. 
> The worse thing is queue has too much nodes in the zookeeper, the cluster can't publish
the right status because only one overseer work, I have to start three threads to clear the
queue nodes. The cluster doesn't work normal near 30 minutes...

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message