lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: Auto recovery of a failed Solr Cloud Node?
Date Thu, 27 Sep 2018 14:12:19 GMT
On 9/27/2018 8:00 AM, Shawn Heisey wrote:
> On 9/27/2018 7:24 AM, Kimber, Mike wrote:
>> I'm trying to determine if there is any health check available to 
>> determine the above and then if the issue happens then an automated 
>> mechanism in SolrCloud to restart the instance. Or is this something 
>> we have to code ourselves?
> As shipped by the project, Solr will never restart itself 
> automatically.  If it dies, it's dead until you start it again, unless 
> you implement something to restart it automatically.This is 
> intentional -- Solr almost never dies unless there's some kind of 
> problem -- not enough memory, corrupt software, etc.If Solr *does* 
> die, you need to figure out why and fix it, not rely on an automatic 
> restart. 

Replying to myself.  Probably a sign of insanity!

The other side of that coin is a completely unresponsive server.  Here's 
the thing about that situation:  If it's really unresponsive, it 
probably wouldn't be possible to send Solr a message to tell it to 
restart itself.  When a server in SolrCloud becomes unresponsive, 
SolrCloud will attempt to have it do an index recovery, but this does 
NOT involve a restart.  Solr cannot restart itself automatically.  It 
might be possible to write that functionality into Solr, but I think 
that using such functionality for automatic restarts on problem 
detection is a very bad idea. The root of the problem must be found and 
fixed, a restart probably isn't going to get rid of it.

If a SolrCloud server remains unresponsive, then any recovery operation 
that is initiated is going to fail.  Typically, problems that lead to an 
unresponsive server are not the kind of problems that will go away 
without action by the administrator -- adding memory, reducing the index 
size, etc.  If the admin restarts the server to clear that kind of 
problem, it's very likely that the problem will happen again.


View raw message