Thanks for your responses, Shawn and Erick.

Some clarification questions, but first a description of my (non-standard) use case:

My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are working well so far on the production cluster (knock wood); its the staging cluster that's giving me fits. Here's why: In order to save money, I have the AWS auto-scaler scale the cluster down to zero nodes when it's not in use. Here's the (automated) procedure:

SCALE DOWN
1) Call admin/collections?action=BACKUP for each collection to a shared NFS volume
2) Shut down all the nodes

SCALE UP
1) Spin up 2 Zookeeper nodes and wait for them to stabilize
2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's live_nodes
3) Call admin/collections?action=RESTORE to put all the collections back

This has been working very well, for the most part, with the following complications/observations:

1) If I don't optimize each collection right before BACKUP, the backup fails (see the attached solr_backup_error.json).
2) If I don't specify a replicationFactor during RESTORE, the admin interface's Cloud diagram only shows one active node per collection. Is this expected? Am I required to specify the replicationFactor unless I'm using a shared HDFS volume for solr data?
3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning message in the response, even though the restore seems to succeed.
4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do not currently have any replication stuff configured (as it seems I should not).
5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to think all replicas were live and synced and happy, and because I was accessing solr through a round-robin load balancer, I was never able to tell which node was out of sync.

If it happens again, I'll make node-by-node requests and try to figure out what's different about the failing one. But the fact that this happened (and the way it happened) is making me wonder if/how I can automate this automated staging environment scaling reliably and with confidence that it will Just Work™.

Comments and suggestions would be GREATLY appreciated.

Michael



On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerickson@gmail.com> wrote:
And please do not use optimize unless your index is
totally static. I only recommend it when the pattern is
to update the index periodically, like every day or
something and not update any docs in between times.

Implied in Shawn's e-mail was that you should undo
anything you've done in terms of configuring replication,
just go with the defaults.

Finally, my bet is that your problematic Solr node is misconfigured.

Best,
Erick

On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apache@elyograg.org> wrote:
> On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
>> seems to be working OK, except that one of the nodes never seems to get its
>> replica updated.
>>
>> Queries take place through a non-caching, round-robin load balancer. The
>> collection looks fine, with one shard and a replicationFactor of 3.
>> Everything in the cloud diagram is green.
>>
>> But if I (for example) select?q=id:hd76s004z, the results come up empty 1
>> out of every 3 times.
>>
>> Even several minutes after a commit and optimize, one replica still isn’t
>> returning the right info.
>>
>> Do I need to configure my `solrconfig.xml` with `replicateAfter` options on
>> the `/replication` requestHandler, or is that a non-solrcloud,
>> standalone-replication thing?
>
> This is one of the more confusing aspects of SolrCloud.
>
> When everything is working perfectly in a SolrCloud install, the feature
> in Solr called "replication" is *never* used.  SolrCloud does require
> the replication feature, though ... which is what makes this whole thing
> very confusing.
>
> Replication is used to replicate an entire Lucene index (consisting of a
> bunch of files on the disk) from a core on a master server to a core on
> a slave server.  This is how replication was done before SolrCloud was
> created.
>
> The way that SolrCloud keeps replicas in sync is *entirely* different.
> SolrCloud has no masters and no slaves.  When you index or delete a
> document in a SolrCloud collection, the request is forwarded to the
> leader of the correct shard for that document.  The leader then sends a
> copy of that request to all the other replicas, and each replica
> (including the leader) independently handles the updates that are in the
> request.  Since all replicas index the same content, they stay in sync.
>
> What SolrCloud does with the replication feature is index recovery.  In
> some situations recovery can be done from the leader's transaction log,
> but when a replica has gotten so far out of sync that the only option
> available is to completely replace the index on the bad replica,
> SolrCloud will fire up the replication feature and create an exact copy
> of the index from the replica that is currently elected as leader.
> SolrCloud temporarily designates the leader core as master and the bad
> replica as slave, then initiates a one-time replication.  This is all
> completely automated and requires no configuration or input from the
> administrator.
>
> The configuration elements you have asked about are for the old
> master-slave replication setup and do not apply to SolrCloud at all.
>
> What I would recommend that you do to solve your immediate issue:  Shut
> down the Solr instance that is having the problem, rename the "data"
> directory in the core that isn't working right to something else, and
> start Solr back up.  As long as you still have at least one good replica
> in the cloud, SolrCloud will see that the index data is gone and copy
> the index from the leader.  You could delete the data directory instead
> of renaming it, but that would leave you with no "undo" option.
>
> Thanks,
> Shawn
>