lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Solr Cloud: Zookeeper failure modes
Date Wed, 02 Jan 2019 21:32:17 GMT
Right, don't quite know what I was thinking about. Even so, if
ZooKeeper is gone you'd still have to rebuild the .system collection
too. Or at least figure out how to access it again.

On Wed, Jan 2, 2019 at 10:21 AM Gus Heck <gus.heck@gmail.com> wrote:
>
> I thought jar files for custom code were meant to go into the '.system'
> collection, not zookeeper. Did I miss a new/old storage option?
>
> On Wed, Jan 2, 2019, 12:25 PM Erick Erickson <erickerickson@gmail.com wrote:
>
> > 1> no. At one point, this could be done in the sense that the
> > collections would be reconstructed, (legacyCloud) but that turned out
> > to have.. side effects. Even in that case, though, Solr couldn't
> > reconstruct the configsets. (insert rant that you really must store
> > your configsets in a VCS system somewhere IMO).
> >
> > 2> Should be fine, as long as the state changes don't include things
> > like adding replicas or collections or you've changed your configsets.
> > ZK has nothing to do with commits for instance. Leader election is
> > recorded in ZK, but other leaders will be elected if necessary. Again,
> > though, if you've changed the topology (added replicas and/or
> > collections and/or shards if using implicit routing) between the time
> > you took the snapshot and ZK failed you'll have an incomplete restored
> > state.
> >
> > Now, all that said ZooKeeper data is "just data". Apart from blobs
> > stored in ZK, you can manually reconstruct the whole thing  with a
> > text editor and upload it. this would be tedious and error-prone to be
> > sure, but do-able. Periodically storing away a copy of the Collections
> > API CLUSTERSTATUS would help a lot.
> >
> > Another approach would be to simply re-create your collections with
> > the exact same shard count. That'll create replicas with the same
> > ranges etc. Then shut your Solr instances down and copy the data
> > directory from the correct old replica to the correct new replica.
> > Once you're satisfied that things are running, you can delete the old
> > (unused) data. As an aside, in this case I'd create my new
> > collection(s) as leader-only (1 replica), then copy as necessary and
> > verify that things were as expected. Once that was done, I'd use
> > ADDREPLICA to build out the new collection(s). This pre-supposes you
> > can get your configsets back from VCS as well as any binary data
> > you've stored in ZK (e.g. jar files for custom code and the like).
> >
> > So overall it's do-able even without ZK snapshots _assuming_ you can
> > find copies of your configsets and any custom code you've stored in
> > ZK. Not something I'd really _like_ to do, but in an emergency you
> > have options.
> >
> > But backing up ZK snapshots in a safe place would be, by far, the
> > easiest and safest thing to do....
> >
> > HTH,
> > Erick
> >
> > On Wed, Jan 2, 2019 at 12:36 AM Pavel Micka <Pavel.Micka@zoomint.com>
> > wrote:
> > >
> > > Hi,
> > > We are currently implementing Solr cloud and as part of this effort we
> > are investigating, which failure modes may happen between Solr and
> > Zookeeper.
> > >
> > > We have found quite a lot articles describing the "happy path" failure,
> > when ZK stops (loses majority) and the Solr Cluster ceases to serve write
> > requests (& read continues to work as expected). Once ZK cluster is
> > reconciled and majority achieved again, everything continues working as
> > expected.
> > >
> > > What we have not been able to find is what happens when ZK cluster
> > catastrophically fails and loses its data. Either completely (scenario A)
> > or is restarted from backup (scenario B).
> > >
> > > So now the questions:
> > >
> > > 1)      Scenario A - Is existing Solr Cloud cluster able to start
> > against a clean Zookeeper and reconstruct all the ZK data from its internal
> > state (using some king of emergency recovery; it may take long)?
> > >
> > > 2)      Scenario B - What is the worst case backup/restore scenario? For
> > example when
> > >
> > > a.       ZK is backed up
> > >
> > > b.       Cluster performs some transition between states "X -> Y" (such
> > as commit shard, elect new leader etc.)
> > >
> > > c.       ZK fails completely
> > >
> > > d.       ZK is restored from backup created in step a
> > >
> > > e.       Solr Cloud is in state "Y", while ZK is in state "X"
> > >
> > > Thanks in advance,
> > >
> > > Pavel
> > >
> >

Mime
View raw message