kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <wangg...@gmail.com>
Subject Re: restarting a broker during partition reassignment
Date Wed, 25 Jun 2014 15:09:16 GMT
Kafka should not reset offset to zero by itself, do you see any exceptions
on the Zookeeper logs? There are some known bugs on ZK that can cause
broker registration node deleted, but I am not sure if some bugs can cause
offset reset.

Guozhang


On Tue, Jun 24, 2014 at 8:44 AM, Luke Forehand <
luke.forehand@networkedinsights.com> wrote:

>
> My hypothesis for how Partition [luke3,3] with leader 11, had offset reset
> to zero, caused by reboot of leader broker during partition reassignment:
>
> The replicas for [luke3,3] were in progress being reassigned from broker
> 10,11,12 -> 11,12,13
> I rebooted broker 11 which was the leader for [luke3.3]
> Broker 12 and 13 logs indicate replica fetch failures from leader 11 due
> to connection time out
>
> Broker 10 attempts to become the leader for [luke3,3] but has an issue (I
> see a zk exception but I'm unsure what is happening)
>
> Broker 11 eventually comes online and attempts to fetch from new leader 10
> Broker 11 completes fetch from leader 10 at offset 0
> Broker 10 is leader but is serving a new data log and offset has been reset
> Remaining brokers truncate logs and follow broker 10
>
> Gist of logs for brokers 13,11,12 that I think backs up this summary:
> https://gist.github.com/anonymous/cb79dc251d87e334cfff
>
>
>
> Thanks,
> Luke Forehand |  Networked Insights  |  Software Engineer
>
>
>
> On 6/23/14, 5:57 PM, "Guozhang Wang" <wangguoz@gmail.com> wrote:
>
> >Hi Luke,
> >
> >What are the exceptions/warnings you saw in the broker and controller
> >logs?
> >
> >Guozhang
> >
> >
> >On Mon, Jun 23, 2014 at 2:03 PM, Luke Forehand <
> >luke.forehand@networkedinsights.com> wrote:
> >
> >> Hello,
> >>
> >> I am testing kafka 0.8.1.1 in preparation for an upgrade from
> >> kafka-0.8.1-beta.  I have a 4 node cluster with one broker per node,
> >>and a
> >> topic with 8 partitions and 3 replicas.  Each partition has about 6
> >> million records.
> >>
> >> I generated a partition reassignment json that basically causes all
> >> partitions to be shifted by one broker.  As the reassignment was in
> >> progress I bounced one of the servers.  After the server came back up
> >>and
> >> the broker started, I waited for the server logs to stop complaining and
> >> then ran the reassignment verify script and all partitions were verified
> >> as completed reassignment.
> >>
> >> However, one of the partition offsets was reset to 0, and 4 out of 8
> >> partitions only had 2 in-sync-replicas instead of 3 (in-sync came back
> >>to
> >> 3 but only after I again bounced the server I had previously bounced
> >> during reassignment).
> >>
> >> Is this considered a bug?  I ask because we use the SimpleConsumer API
> >>so
> >> we keep track of our own offset "pointers".  If it is not a bug then I
> >> could reset the pointer to "earliest" and continue reading, but I was
> >> wondering if there is a potential for data loss in my scenario.  I have
> >> plenty of logs and can reproduce but before I spam I was wondering if
> >> there is already a jira task for this issue or if anybody else is aware.
> >>
> >> Thanks,
> >> Luke Forehand |  Networked Insights  |  Software Engineer
> >>
> >>
> >
> >
> >--
> >-- Guozhang
>
>


-- 
-- Guozhang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message