kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjun Kota <ar...@socialtwist.com>
Subject Re: doubt regarding consumer rebalance exception.
Date Mon, 14 Apr 2014 04:11:39 GMT
Yup will check and let you know.

Sry for delayed response i live in other part of world.
On Apr 14, 2014 1:35 AM, "Guozhang Wang" <wangguoz@gmail.com> wrote:

> Hi Arjun,
>
> Could you check if your second, i.e. the added machine has a lot of long
> GCs while during rebalances?
>
> Guozhang
>
>
> On Fri, Apr 11, 2014 at 7:52 PM, Arjun Kota <arjun@socialtwist.com> wrote:
>
> > Yes u see a lot of them they come continuously while consumer us
> retrying.
> >
> > Thanks
> > Arjun narasimha kota
> > On Apr 12, 2014 6:49 AM, "Guozhang Wang" <wangguoz@gmail.com> wrote:
> >
> > > Did you see any log entries such as
> > >
> > > "conflict in ZK path" in your consumer logs?
> > >
> > > Guozhang
> > >
> > >
> > > On Fri, Apr 11, 2014 at 9:54 AM, Arjun Kota <arjun@socialtwist.com>
> > wrote:
> > >
> > > > I set the retries with 10 and set the max time between retries to 5
> > > seconds
> > > > even then i see this .
> > > >
> > > > Thanks
> > > > Arjun narasimha kota
> > > > On Apr 11, 2014 9:02 PM, "Guozhang Wang" <wangguoz@gmail.com> wrote:
> > > >
> > > > > Arjun,
> > > > >
> > > > > When consumers exhaust all retries of rebalances they will throw
> the
> > > > > exception and stop consuming, and hence some or all partitions
> would
> > > not
> > > > be
> > > > > consumed by anyone. One thing you can do is to increase the
> > num.retries
> > > > on
> > > > > your consumer config.
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Fri, Apr 11, 2014 at 5:05 AM, Arjun <arjun@socialtwist.com>
> > wrote:
> > > > >
> > > > > > I first have a single consumer node with 3 consumer threads
and
> 12
> > > > > > partitions in kafka broker then if i check the owner in the
> > consumer
> > > > > offset
> > > > > > checker the below is the result.
> > > > > >
> > > > > > bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group
> > > group1
> > > > > > --zkconnect zkhost:zkport --topic testtopic
> > > > > > Group           Topic                          Pid Offset logSize
> > > > > > Lag             Owner
> > > > > > group1          testtopic    0   253             253 0
> > > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > > group1          testtopic    1   268             268 0
> > > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > > group1          testtopic    2   258             258 0
> > > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > > group1          testtopic    3   265             265 0
> > > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > > group1          testtopic    4   262             262 0
> > > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > > group1          testtopic    5   296             296 0
> > > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > > group1          testtopic    6   248             248 0
> > > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > > group1          testtopic    7   272             272 0
> > > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > > group1          testtopic    8   242             242 0
> > > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > > group1          testtopic    9   263             263 0
> > > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > > group1          testtopic    10  294             294 0
> > > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > > group1          testtopic    11  254             254 0
> > > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > >
> > > > > > as you see for all partitions owners are present.
> > > > > >
> > > > > > Now i thought that the node is over burdned and i started one
> more
> > > > node.
> > > > > > When i started the second node completely, then  the output
of
> the
> > > > > consumer
> > > > > > offset checker is as below
> > > > > >
> > > > > > bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group
> > > group1
> > > > > > --zkconnect zkhost:zkport --topic testtopic
> > > > > > Group           Topic                          Pid Offset logSize
> > > > > > Lag             Owner
> > > > > > group1          testtopic    0   253             253 0
> > > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > > group1          testtopic    1   268             268 0
> > > > > > group1_xxxx-1397216047177-6f419d28-0
> > > > > > group1          testtopic    2   258             258 0
> > > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > > group1          testtopic    3   265             265 0
> > > > > > group1_xxxx-1397216047177-6f419d28-1
> > > > > > group1          testtopic    4   262             262 0
> > > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > > group1          testtopic    5   296             296 0
> > > > > > group1_xxxx-1397216047177-6f419d28-2
> > > > > > group1          testtopic    6   248             248 0
> > > > none
> > > > > > group1          testtopic    7   272             272 0
> > > > none
> > > > > > group1          testtopic    8   242             242 0
> > > > none
> > > > > > group1          testtopic    9   263             263 0
> > > > none
> > > > > > group1          testtopic    10  294             294 0
> > > > none
> > > > > > group1          testtopic    11  254             254 0
> > > > none
> > > > > >
> > > > > > It has reduced the burden but, the other partitions are not
taken
> > by
> > > > any
> > > > > > node. Because of this messages going into those partitions are
> not
> > > > > getting
> > > > > > retrived.
> > > > > >
> > > > > > The reason i found was there are some conflicts while taking
up
> > these
> > > > > > partitions by the second node, and after 10 retries, it just
gave
> > up.
> > > > > > I tried to restart the second node hoping, restart will make
it
> > take
> > > > the
> > > > > > partitions but it was not. what is the best way out for me in
> this
> > > > > scenario.
> > > > > >
> > > > > > There are cases in our production where we may have to add
> > consumers
> > > > for
> > > > > a
> > > > > > particular topic, if adding consumers is going to result this,
> can
> > > some
> > > > > one
> > > > > > suggest a way out.
> > > > > >
> > > > > > thanks
> > > > > > Arjun NArasimha kota
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Friday 11 April 2014 05:13 PM, Arjun wrote:
> > > > > >
> > > > > >> On the same lines when will the owner column of the result
> > produced
> > > by
> > > > > >> Consumer offset checker will be none?
> > > > > >>
> > > > > >> and what will it signify? does it say that particualr partition
> is
> > > up
> > > > > for
> > > > > >> grab but no one has taken it? why will this happen?
> > > > > >>
> > > > > >> I know i may be asking some silly questions but can some
one
> > please
> > > > help
> > > > > >> me out here.
> > > > > >>
> > > > > >> Thanks
> > > > > >> Arjun Narasimha Kota
> > > > > >>
> > > > > >> On Friday 11 April 2014 04:48 PM, Arjun wrote:
> > > > > >>
> > > > > >>> Some times, the error is even not printed. The blow
line gets
> > > > printed(i
> > > > > >>> increased the number of retires to 10)
> > > > > >>>
> > > > > >>> end rebalancing consumer
> > > > group1_ip-10-122-57-66-1397214466042-81e47bfe
> > > > > >>> try #9
> > > > > >>>
> > > > > >>> and then the consumer just sits idle.
> > > > > >>>
> > > > > >>> Thanks
> > > > > >>> Arjun Narasimha Kota
> > > > > >>>
> > > > > >>> On Friday 11 April 2014 04:33 PM, Arjun wrote:
> > > > > >>>
> > > > > >>>> Once i get this exception
> > > > > >>>>
> > > > > >>>> ERROR consumer.ZookeeperConsumerConnector: [xxxxxxxxxx
],
> error
> > > > during
> > > > > >>>> syncedRebalance
> > > > > >>>>  kafka.common.ConsumerRebalanceFailedException:
xxxxxxxxx
> can't
> > > > > >>>> rebalance after 4 retries
> > > > > >>>>
> > > > > >>>> The consumer is not consuming any more messages.
Is this the
> > > > > behaviour?
> > > > > >>>> is there any property in high level consumer through
which i
> can
> > > say
> > > > > to
> > > > > >>>> consumer to keep retrying, until consumer gets the
data. This
> > > > message
> > > > > is
> > > > > >>>> not atually being thrown in high level consumer.
This is just
> > > logged
> > > > > in the
> > > > > >>>> logger. If the consumer will not get data after
this
> exception,
> > > > > shouldn't
> > > > > >>>> this be thrown at a place user can catch it and
raise an
> alert?
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Thanks
> > > > > >>>> Arjun Narasimha Kota
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message