Thanks for your responses, Shawn and Erick.
Some clarification questions, but first a description of my (non-standard) use case:
My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are working well so far on the production cluster (knock wood); its the staging cluster that's giving me fits. Here's why: In order to save money, I have the AWS auto-scaler scale the cluster down to zero nodes when it's not in use. Here's the (automated) procedure:
1) Call admin/collections?action=BACKUP for each collection to a shared NFS volume
2) Shut down all the nodes
1) Spin up 2 Zookeeper nodes and wait for them to stabilize
2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's live_nodes
3) Call admin/collections?action=RESTORE to put all the collections back
This has been working very well, for the most part, with the following complications/observations:
1) If I don't optimize each collection right before BACKUP, the backup fails (see the attached solr_backup_error.json).
2) If I don't specify a replicationFactor during RESTORE, the admin interface's Cloud diagram only shows one active node per collection. Is this expected? Am I required to specify the replicationFactor unless I'm using a shared HDFS volume for solr data?
3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning message in the response, even though the restore seems to succeed.
4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do not currently have any replication stuff configured (as it seems I should not).
5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to think all replicas were live and synced and happy, and because I was accessing solr through a round-robin load balancer, I was never able to tell which node was out of sync.
If it happens again, I'll make node-by-node requests and try to figure out what's different about the failing one. But the fact that this happened (and the way it happened) is making me wonder if/how I can automate this automated staging environment scaling reliably and with confidence that it will Just Work™.
Comments and suggestions would be GREATLY appreciated.