lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael B. Klein" <>
Subject Re: Replication Question
Date Wed, 02 Aug 2017 14:56:22 GMT
Thanks for your responses, Shawn and Erick.

Some clarification questions, but first a description of my (non-standard)
use case:

My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are working
well so far on the production cluster (knock wood); its the staging cluster
that's giving me fits. Here's why: In order to save money, I have the AWS
auto-scaler scale the cluster down to zero nodes when it's not in use.
Here's the (automated) procedure:

1) Call admin/collections?action=BACKUP for each collection to a shared NFS
2) Shut down all the nodes

1) Spin up 2 Zookeeper nodes and wait for them to stabilize
2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
3) Call admin/collections?action=RESTORE to put all the collections back

This has been working very well, for the most part, with the following

1) If I don't optimize each collection right before BACKUP, the backup
fails (see the attached solr_backup_error.json).
2) If I don't specify a replicationFactor during RESTORE, the admin
interface's Cloud diagram only shows one active node per collection. Is
this expected? Am I required to specify the replicationFactor unless I'm
using a shared HDFS volume for solr data?
3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
message in the response, even though the restore seems to succeed.
4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
not currently have any replication stuff configured (as it seems I should
5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
think all replicas were live and synced and happy, and because I was
accessing solr through a round-robin load balancer, I was never able to
tell which node was out of sync.

If it happens again, I'll make node-by-node requests and try to figure out
what's different about the failing one. But the fact that this happened
(and the way it happened) is making me wonder if/how I can automate this
automated staging environment scaling reliably and with confidence that it
will Just Work™.

Comments and suggestions would be GREATLY appreciated.


On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <>

> And please do not use optimize unless your index is
> totally static. I only recommend it when the pattern is
> to update the index periodically, like every day or
> something and not update any docs in between times.
> Implied in Shawn's e-mail was that you should undo
> anything you've done in terms of configuring replication,
> just go with the defaults.
> Finally, my bet is that your problematic Solr node is misconfigured.
> Best,
> Erick
> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <> wrote:
> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
> >> seems to be working OK, except that one of the nodes never seems to get
> its
> >> replica updated.
> >>
> >> Queries take place through a non-caching, round-robin load balancer. The
> >> collection looks fine, with one shard and a replicationFactor of 3.
> >> Everything in the cloud diagram is green.
> >>
> >> But if I (for example) select?q=id:hd76s004z, the results come up empty
> 1
> >> out of every 3 times.
> >>
> >> Even several minutes after a commit and optimize, one replica still
> isn’t
> >> returning the right info.
> >>
> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
> options on
> >> the `/replication` requestHandler, or is that a non-solrcloud,
> >> standalone-replication thing?
> >
> > This is one of the more confusing aspects of SolrCloud.
> >
> > When everything is working perfectly in a SolrCloud install, the feature
> > in Solr called "replication" is *never* used.  SolrCloud does require
> > the replication feature, though ... which is what makes this whole thing
> > very confusing.
> >
> > Replication is used to replicate an entire Lucene index (consisting of a
> > bunch of files on the disk) from a core on a master server to a core on
> > a slave server.  This is how replication was done before SolrCloud was
> > created.
> >
> > The way that SolrCloud keeps replicas in sync is *entirely* different.
> > SolrCloud has no masters and no slaves.  When you index or delete a
> > document in a SolrCloud collection, the request is forwarded to the
> > leader of the correct shard for that document.  The leader then sends a
> > copy of that request to all the other replicas, and each replica
> > (including the leader) independently handles the updates that are in the
> > request.  Since all replicas index the same content, they stay in sync.
> >
> > What SolrCloud does with the replication feature is index recovery.  In
> > some situations recovery can be done from the leader's transaction log,
> > but when a replica has gotten so far out of sync that the only option
> > available is to completely replace the index on the bad replica,
> > SolrCloud will fire up the replication feature and create an exact copy
> > of the index from the replica that is currently elected as leader.
> > SolrCloud temporarily designates the leader core as master and the bad
> > replica as slave, then initiates a one-time replication.  This is all
> > completely automated and requires no configuration or input from the
> > administrator.
> >
> > The configuration elements you have asked about are for the old
> > master-slave replication setup and do not apply to SolrCloud at all.
> >
> > What I would recommend that you do to solve your immediate issue:  Shut
> > down the Solr instance that is having the problem, rename the "data"
> > directory in the core that isn't working right to something else, and
> > start Solr back up.  As long as you still have at least one good replica
> > in the cloud, SolrCloud will see that the index data is gone and copy
> > the index from the leader.  You could delete the data directory instead
> > of renaming it, but that would leave you with no "undo" option.
> >
> > Thanks,
> > Shawn
> >

View raw message