kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <wangg...@gmail.com>
Subject Re: doubt regarding consumer rebalance exception.
Date Sun, 13 Apr 2014 20:04:57 GMT
Hi Arjun,

Could you check if your second, i.e. the added machine has a lot of long
GCs while during rebalances?

Guozhang


On Fri, Apr 11, 2014 at 7:52 PM, Arjun Kota <arjun@socialtwist.com> wrote:

> Yes u see a lot of them they come continuously while consumer us retrying.
>
> Thanks
> Arjun narasimha kota
> On Apr 12, 2014 6:49 AM, "Guozhang Wang" <wangguoz@gmail.com> wrote:
>
> > Did you see any log entries such as
> >
> > "conflict in ZK path" in your consumer logs?
> >
> > Guozhang
> >
> >
> > On Fri, Apr 11, 2014 at 9:54 AM, Arjun Kota <arjun@socialtwist.com>
> wrote:
> >
> > > I set the retries with 10 and set the max time between retries to 5
> > seconds
> > > even then i see this .
> > >
> > > Thanks
> > > Arjun narasimha kota
> > > On Apr 11, 2014 9:02 PM, "Guozhang Wang" <wangguoz@gmail.com> wrote:
> > >
> > > > Arjun,
> > > >
> > > > When consumers exhaust all retries of rebalances they will throw the
> > > > exception and stop consuming, and hence some or all partitions would
> > not
> > > be
> > > > consumed by anyone. One thing you can do is to increase the
> num.retries
> > > on
> > > > your consumer config.
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Fri, Apr 11, 2014 at 5:05 AM, Arjun <arjun@socialtwist.com>
> wrote:
> > > >
> > > > > I first have a single consumer node with 3 consumer threads and 12
> > > > > partitions in kafka broker then if i check the owner in the
> consumer
> > > > offset
> > > > > checker the below is the result.
> > > > >
> > > > > bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group
> > group1
> > > > > --zkconnect zkhost:zkport --topic testtopic
> > > > > Group           Topic                          Pid Offset logSize
> > > > > Lag             Owner
> > > > > group1          testtopic    0   253             253 0
> > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > group1          testtopic    1   268             268 0
> > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > group1          testtopic    2   258             258 0
> > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > group1          testtopic    3   265             265 0
> > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > group1          testtopic    4   262             262 0
> > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > group1          testtopic    5   296             296 0
> > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > group1          testtopic    6   248             248 0
> > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > group1          testtopic    7   272             272 0
> > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > group1          testtopic    8   242             242 0
> > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > group1          testtopic    9   263             263 0
> > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > group1          testtopic    10  294             294 0
> > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > group1          testtopic    11  254             254 0
> > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > >
> > > > > as you see for all partitions owners are present.
> > > > >
> > > > > Now i thought that the node is over burdned and i started one more
> > > node.
> > > > > When i started the second node completely, then  the output of the
> > > > consumer
> > > > > offset checker is as below
> > > > >
> > > > > bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group
> > group1
> > > > > --zkconnect zkhost:zkport --topic testtopic
> > > > > Group           Topic                          Pid Offset logSize
> > > > > Lag             Owner
> > > > > group1          testtopic    0   253             253 0
> > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > group1          testtopic    1   268             268 0
> > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > group1          testtopic    2   258             258 0
> > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > group1          testtopic    3   265             265 0
> > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > group1          testtopic    4   262             262 0
> > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > group1          testtopic    5   296             296 0
> > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > group1          testtopic    6   248             248 0
> > > none
> > > > > group1          testtopic    7   272             272 0
> > > none
> > > > > group1          testtopic    8   242             242 0
> > > none
> > > > > group1          testtopic    9   263             263 0
> > > none
> > > > > group1          testtopic    10  294             294 0
> > > none
> > > > > group1          testtopic    11  254             254 0
> > > none
> > > > >
> > > > > It has reduced the burden but, the other partitions are not taken
> by
> > > any
> > > > > node. Because of this messages going into those partitions are not
> > > > getting
> > > > > retrived.
> > > > >
> > > > > The reason i found was there are some conflicts while taking up
> these
> > > > > partitions by the second node, and after 10 retries, it just gave
> up.
> > > > > I tried to restart the second node hoping, restart will make it
> take
> > > the
> > > > > partitions but it was not. what is the best way out for me in this
> > > > scenario.
> > > > >
> > > > > There are cases in our production where we may have to add
> consumers
> > > for
> > > > a
> > > > > particular topic, if adding consumers is going to result this, can
> > some
> > > > one
> > > > > suggest a way out.
> > > > >
> > > > > thanks
> > > > > Arjun NArasimha kota
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Friday 11 April 2014 05:13 PM, Arjun wrote:
> > > > >
> > > > >> On the same lines when will the owner column of the result
> produced
> > by
> > > > >> Consumer offset checker will be none?
> > > > >>
> > > > >> and what will it signify? does it say that particualr partition
is
> > up
> > > > for
> > > > >> grab but no one has taken it? why will this happen?
> > > > >>
> > > > >> I know i may be asking some silly questions but can some one
> please
> > > help
> > > > >> me out here.
> > > > >>
> > > > >> Thanks
> > > > >> Arjun Narasimha Kota
> > > > >>
> > > > >> On Friday 11 April 2014 04:48 PM, Arjun wrote:
> > > > >>
> > > > >>> Some times, the error is even not printed. The blow line
gets
> > > printed(i
> > > > >>> increased the number of retires to 10)
> > > > >>>
> > > > >>> end rebalancing consumer
> > > group1_ip-10-122-57-66-1397214466042-81e47bfe
> > > > >>> try #9
> > > > >>>
> > > > >>> and then the consumer just sits idle.
> > > > >>>
> > > > >>> Thanks
> > > > >>> Arjun Narasimha Kota
> > > > >>>
> > > > >>> On Friday 11 April 2014 04:33 PM, Arjun wrote:
> > > > >>>
> > > > >>>> Once i get this exception
> > > > >>>>
> > > > >>>> ERROR consumer.ZookeeperConsumerConnector: [xxxxxxxxxx
], error
> > > during
> > > > >>>> syncedRebalance
> > > > >>>>  kafka.common.ConsumerRebalanceFailedException: xxxxxxxxx
can't
> > > > >>>> rebalance after 4 retries
> > > > >>>>
> > > > >>>> The consumer is not consuming any more messages. Is this
the
> > > > behaviour?
> > > > >>>> is there any property in high level consumer through
which i can
> > say
> > > > to
> > > > >>>> consumer to keep retrying, until consumer gets the data.
This
> > > message
> > > > is
> > > > >>>> not atually being thrown in high level consumer. This
is just
> > logged
> > > > in the
> > > > >>>> logger. If the consumer will not get data after this
exception,
> > > > shouldn't
> > > > >>>> this be thrown at a place user can catch it and raise
an alert?
> > > > >>>>
> > > > >>>>
> > > > >>>> Thanks
> > > > >>>> Arjun Narasimha Kota
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message