lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
Date Wed, 05 Nov 2014 19:56:33 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198963#comment-14198963
] 

Erick Erickson commented on SOLR-6707:
--------------------------------------

Hmm, thanks for the writeup.

bq: Solr experienced an "internal server error" I believe due in part to a fairly new feature
we are using, which seemingly caused all cores to go down

Anything in particular about the new feature we should know about? Or does this happen with
any generic internal server error?



> Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue
is clogged
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6707
>                 URL: https://issues.apache.org/jira/browse/SOLR-6707
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>
> We experienced an issue the other day that brought a production solr server down, and
this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually down because
it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless
since it's not currently in use. 
> - Solr experienced an "internal server error" I believe due in part to a fairly new feature
we are using, which seemingly caused all cores to go down. 
> - Solr immediately went into recovery, and subsequent leader election for each shard
of each core. 
> - Our primary core recovered immediately. Our additional core which was never active
in the first place, attempted to recover but of course couldn't due to the improper configs.

> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times
per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster coordination
can no longer play out, and Solr topples over. 
> I know this is a bit of an unusual circumstance due to us keeping the dead core around,
and our quick solution has been to remove said core. However I can see other potential scenarios
that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message