I'm looking to see if there is any documentation describing
situations which could trigger a shard replica to go into recovery.
We're having issues with some of our replicas randomly going into
recovery mode. For example, I know of certain conditions when this
may happen such as if the node loses connectivity to Zookeeper
(i.e., timeouts) and clusterconfig.json is updated to show the node
as down or if using hard commits for every query (we're not using
this) which can overtax the server and also cause timeouts.
We would like to get to the bottom of why these nodes are being
requested to recover and by whom (i.e., Zookeeper / Overseer or the
The bigger reason for this is that some of our datasets can take
days to recover (this is due to the types of queries being issued
and the amount of continuous ingress traffic, these queries are
currently being addressed and optimized by our engineering team).
Until then, I'd like to find a way to prevent these nodes from going
into recovery mode in the first place.
Please let me know if there are any docs that describe recovery
scenarios or troubleshooting or if any of you have experience with
this situation. Any help is greatly appreciated.