I'm looking to see if there is any documentation describing situations which could trigger a shard replica to go into recovery. We're having issues with some of our replicas randomly going into recovery mode. For example, I know of certain conditions when this may happen such as if the node loses connectivity to Zookeeper (i.e., timeouts) and clusterconfig.json is updated to show the node as down or if using hard commits for every query (we're not using this) which can overtax the server and also cause timeouts.

We would like to get to the bottom of why these nodes are being requested to recover and by whom (i.e., Zookeeper / Overseer or the shard leader).

The bigger reason for this is that some of our datasets can take days to recover (this is due to the types of queries being issued and the amount of continuous ingress traffic, these queries are currently being addressed and optimized by our engineering team). Until then, I'd like to find a way to prevent these nodes from going into recovery mode in the first place.

Please let me know if there are any docs that describe recovery scenarios or troubleshooting or if any of you have experience with this situation. Any help is greatly appreciated.



Brian Wright
Sr. Systems Engineer
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
Email  brianw@marketo.com
Phone +1.650.539.3530

Marketo Logo