lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gus Heck <>
Subject Re: Solr Cloud: Zookeeper failure modes
Date Wed, 02 Jan 2019 18:20:58 GMT
I thought jar files for custom code were meant to go into the '.system'
collection, not zookeeper. Did I miss a new/old storage option?

On Wed, Jan 2, 2019, 12:25 PM Erick Erickson < wrote:

> 1> no. At one point, this could be done in the sense that the
> collections would be reconstructed, (legacyCloud) but that turned out
> to have.. side effects. Even in that case, though, Solr couldn't
> reconstruct the configsets. (insert rant that you really must store
> your configsets in a VCS system somewhere IMO).
> 2> Should be fine, as long as the state changes don't include things
> like adding replicas or collections or you've changed your configsets.
> ZK has nothing to do with commits for instance. Leader election is
> recorded in ZK, but other leaders will be elected if necessary. Again,
> though, if you've changed the topology (added replicas and/or
> collections and/or shards if using implicit routing) between the time
> you took the snapshot and ZK failed you'll have an incomplete restored
> state.
> Now, all that said ZooKeeper data is "just data". Apart from blobs
> stored in ZK, you can manually reconstruct the whole thing  with a
> text editor and upload it. this would be tedious and error-prone to be
> sure, but do-able. Periodically storing away a copy of the Collections
> API CLUSTERSTATUS would help a lot.
> Another approach would be to simply re-create your collections with
> the exact same shard count. That'll create replicas with the same
> ranges etc. Then shut your Solr instances down and copy the data
> directory from the correct old replica to the correct new replica.
> Once you're satisfied that things are running, you can delete the old
> (unused) data. As an aside, in this case I'd create my new
> collection(s) as leader-only (1 replica), then copy as necessary and
> verify that things were as expected. Once that was done, I'd use
> ADDREPLICA to build out the new collection(s). This pre-supposes you
> can get your configsets back from VCS as well as any binary data
> you've stored in ZK (e.g. jar files for custom code and the like).
> So overall it's do-able even without ZK snapshots _assuming_ you can
> find copies of your configsets and any custom code you've stored in
> ZK. Not something I'd really _like_ to do, but in an emergency you
> have options.
> But backing up ZK snapshots in a safe place would be, by far, the
> easiest and safest thing to do....
> HTH,
> Erick
> On Wed, Jan 2, 2019 at 12:36 AM Pavel Micka <>
> wrote:
> >
> > Hi,
> > We are currently implementing Solr cloud and as part of this effort we
> are investigating, which failure modes may happen between Solr and
> Zookeeper.
> >
> > We have found quite a lot articles describing the "happy path" failure,
> when ZK stops (loses majority) and the Solr Cluster ceases to serve write
> requests (& read continues to work as expected). Once ZK cluster is
> reconciled and majority achieved again, everything continues working as
> expected.
> >
> > What we have not been able to find is what happens when ZK cluster
> catastrophically fails and loses its data. Either completely (scenario A)
> or is restarted from backup (scenario B).
> >
> > So now the questions:
> >
> > 1)      Scenario A - Is existing Solr Cloud cluster able to start
> against a clean Zookeeper and reconstruct all the ZK data from its internal
> state (using some king of emergency recovery; it may take long)?
> >
> > 2)      Scenario B - What is the worst case backup/restore scenario? For
> example when
> >
> > a.       ZK is backed up
> >
> > b.       Cluster performs some transition between states "X -> Y" (such
> as commit shard, elect new leader etc.)
> >
> > c.       ZK fails completely
> >
> > d.       ZK is restored from backup created in step a
> >
> > e.       Solr Cloud is in state "Y", while ZK is in state "X"
> >
> > Thanks in advance,
> >
> > Pavel
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message