kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rosenberg <...@squareup.com>
Subject Re: ISR shrink to 0?
Date Thu, 20 Nov 2014 02:58:56 GMT
What if it never comes back with unclean leader election disabled (but
another broker does come back)?

On Wed, Nov 19, 2014 at 9:32 PM, Jun Rao <junrao@gmail.com> wrote:

> In that case, we just wait until the broker in ISR is back and make it the
> leader and take whatever data is has.
>
> Thanks,
>
> Jun
>
> On Tue, Nov 18, 2014 at 10:36 PM, Jason Rosenberg <jbr@squareup.com>
> wrote:
>
> > Ok,
> >
> > Makes sense.  But if the node is not actually healthy (and underwent a
> hard
> > crash) it would likely not be able to avoid an 'unclean' restart.....what
> > happens if unclean leader election is disabled, but there are no 'clean'
> > partitions available?
> >
> > Jason
> >
> > On Wed, Nov 19, 2014 at 12:40 AM, Jun Rao <junrao@gmail.com> wrote:
> >
> > > Yes, we will preserve the last replica in ISR. This way, we know which
> > > replica has all committed messages and can wait for it to come back as
> > the
> > > leader, if unclean leader election is disabled.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Nov 17, 2014 at 11:06 AM, Jason Rosenberg <jbr@squareup.com>
> > > wrote:
> > >
> > > > We have had 2 nodes in a 4 node cluster die this weekend, sadly.
> > > > Fortunately there was no critical data on these machines yet.
> > > >
> > > > The cluster is running 0.8.1.1, and using replication factor of 2
> for 2
> > > > topics, each with 20 partitions.
> > > >
> > > > For sake of discussion, assume that nodes A and B are still up, and C
> > > and D
> > > > are now down.
> > > >
> > > > As expected, partitions that had one replica on a good host (A or B)
> > and
> > > > one on a bad node (C or D), had their ISR shrink to just 1 node (A or
> > B).
> > > >
> > > > Roughly 1/6 of the partitions had their 2 replicas on the 2 bad
> nodes,
> > C
> > > > and D.  For these, I was expecting the ISR to show up as empty, and
> the
> > > > partition unavailable.
> > > >
> > > > However, that's not what I'm seeing.  When running TopicCommand
> > > --describe,
> > > > I see that the ISR still shows 1 replica, on node D (D was the second
> > > node
> > > > to go down).
> > > >
> > > > And, producers are still periodically trying to produce to node D
> (but
> > > > failing and retrying to one of the good nodes).
> > > >
> > > > So, it seems the cluster's meta data is still thinking that node D is
> > up
> > > > and serving the partitions that were only replicated on C and D.
> > >  However,
> > > > for partitions that were on A and D, or B and D, D is not shown as
> > being
> > > in
> > > > the ISR.
> > > >
> > > > Is this correct?  Should the cluster continue showing the last node
> to
> > > have
> > > > been alive for a partition as still in the ISR?
> > > >
> > > > Jason
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message