lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael B. Klein" <mbkl...@gmail.com>
Subject Re: Replication Question
Date Wed, 02 Aug 2017 15:31:44 GMT
Another observation: After bringing the cluster back up just now, the
"1-in-3 nodes don't get the updates" issue persists, even with the cloud
diagram showing 3 nodes, all green.

On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mbklein@gmail.com> wrote:

> Thanks for your responses, Shawn and Erick.
>
> Some clarification questions, but first a description of my (non-standard)
> use case:
>
> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are
> working well so far on the production cluster (knock wood); its the staging
> cluster that's giving me fits. Here's why: In order to save money, I have
> the AWS auto-scaler scale the cluster down to zero nodes when it's not in
> use. Here's the (automated) procedure:
>
> SCALE DOWN
> 1) Call admin/collections?action=BACKUP for each collection to a shared
> NFS volume
> 2) Shut down all the nodes
>
> SCALE UP
> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
> live_nodes
> 3) Call admin/collections?action=RESTORE to put all the collections back
>
> This has been working very well, for the most part, with the following
> complications/observations:
>
> 1) If I don't optimize each collection right before BACKUP, the backup
> fails (see the attached solr_backup_error.json).
> 2) If I don't specify a replicationFactor during RESTORE, the admin
> interface's Cloud diagram only shows one active node per collection. Is
> this expected? Am I required to specify the replicationFactor unless I'm
> using a shared HDFS volume for solr data?
> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
> message in the response, even though the restore seems to succeed.
> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
> not currently have any replication stuff configured (as it seems I should
> not).
> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
> think all replicas were live and synced and happy, and because I was
> accessing solr through a round-robin load balancer, I was never able to
> tell which node was out of sync.
>
> If it happens again, I'll make node-by-node requests and try to figure out
> what's different about the failing one. But the fact that this happened
> (and the way it happened) is making me wonder if/how I can automate this
> automated staging environment scaling reliably and with confidence that it
> will Just Work™.
>
> Comments and suggestions would be GREATLY appreciated.
>
> Michael
>
>
>
> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> And please do not use optimize unless your index is
>> totally static. I only recommend it when the pattern is
>> to update the index periodically, like every day or
>> something and not update any docs in between times.
>>
>> Implied in Shawn's e-mail was that you should undo
>> anything you've done in terms of configuring replication,
>> just go with the defaults.
>>
>> Finally, my bet is that your problematic Solr node is misconfigured.
>>
>> Best,
>> Erick
>>
>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <apache@elyograg.org> wrote:
>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
>> >> seems to be working OK, except that one of the nodes never seems to
>> get its
>> >> replica updated.
>> >>
>> >> Queries take place through a non-caching, round-robin load balancer.
>> The
>> >> collection looks fine, with one shard and a replicationFactor of 3.
>> >> Everything in the cloud diagram is green.
>> >>
>> >> But if I (for example) select?q=id:hd76s004z, the results come up
>> empty 1
>> >> out of every 3 times.
>> >>
>> >> Even several minutes after a commit and optimize, one replica still
>> isn’t
>> >> returning the right info.
>> >>
>> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
>> options on
>> >> the `/replication` requestHandler, or is that a non-solrcloud,
>> >> standalone-replication thing?
>> >
>> > This is one of the more confusing aspects of SolrCloud.
>> >
>> > When everything is working perfectly in a SolrCloud install, the feature
>> > in Solr called "replication" is *never* used.  SolrCloud does require
>> > the replication feature, though ... which is what makes this whole thing
>> > very confusing.
>> >
>> > Replication is used to replicate an entire Lucene index (consisting of a
>> > bunch of files on the disk) from a core on a master server to a core on
>> > a slave server.  This is how replication was done before SolrCloud was
>> > created.
>> >
>> > The way that SolrCloud keeps replicas in sync is *entirely* different.
>> > SolrCloud has no masters and no slaves.  When you index or delete a
>> > document in a SolrCloud collection, the request is forwarded to the
>> > leader of the correct shard for that document.  The leader then sends a
>> > copy of that request to all the other replicas, and each replica
>> > (including the leader) independently handles the updates that are in the
>> > request.  Since all replicas index the same content, they stay in sync.
>> >
>> > What SolrCloud does with the replication feature is index recovery.  In
>> > some situations recovery can be done from the leader's transaction log,
>> > but when a replica has gotten so far out of sync that the only option
>> > available is to completely replace the index on the bad replica,
>> > SolrCloud will fire up the replication feature and create an exact copy
>> > of the index from the replica that is currently elected as leader.
>> > SolrCloud temporarily designates the leader core as master and the bad
>> > replica as slave, then initiates a one-time replication.  This is all
>> > completely automated and requires no configuration or input from the
>> > administrator.
>> >
>> > The configuration elements you have asked about are for the old
>> > master-slave replication setup and do not apply to SolrCloud at all.
>> >
>> > What I would recommend that you do to solve your immediate issue:  Shut
>> > down the Solr instance that is having the problem, rename the "data"
>> > directory in the core that isn't working right to something else, and
>> > start Solr back up.  As long as you still have at least one good replica
>> > in the cloud, SolrCloud will see that the index data is gone and copy
>> > the index from the leader.  You could delete the data directory instead
>> > of renaming it, but that would leave you with no "undo" option.
>> >
>> > Thanks,
>> > Shawn
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message