kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewen Cheslack-Postava <e...@confluent.io>
Subject Re: Best approach to frequently restarting consumer process
Date Sun, 11 Dec 2016 03:37:22 GMT
Consumer groups aren't going to handle 'let it crash' particularly well
(and really any session-based services, but particularly consumer groups
since a single failure affects the entire group). That said, 'let it crash'
doesn't necessarily have to mean 'don't try to clean up at all'. The
consumer group will recover *much* more quickly if you make sure any crash
path includes a:

finally {

block to do some minimal cleanup. This will cause the consumer to make a
best effort to explicitly leave the group, allowing rebalancing to complete
after the rest of the members rejoin. If you don't do this, your rebalances
get much more expensive since the group coordinator needs to wait for the
session timeout. This will probably notice to noticeably longer pauses. The
one drawback to doing this today is that the close() can potentially block,
so it may not fail as fast as you want it to -- it would be good to get a
timeout-based close() implemented as well. That said, the LeaveGroup
request *is* best effort, so if the consumer was otherwise in a healthy
state, this should be very fast.

All this said, 'let it crash' isn't the same thing as 'constant crashes are
ok'. It's a fault recovery methodology, but crashing every 5 minutes isn't
what the telecom industry had in mind... If things are crashing that
frequently, there is likely a very common bug/memory leak/etc which can be
fixed to significantly reduce the frequency of crashes. Generally 'let it
crash' systems also provide a good way to also collect debugging
information for exactly this purpose.


On Wed, Dec 7, 2016 at 1:38 AM, Harald Kirsch <harald.kirsch@raytion.com>

> With 'restart' I mean a 'let it crash' setup (as promoted by Erlang and
> Akka, e.g. http://doc.akka.io/docs/akka/snapshot/intro/what-is-akka.html).
> The consumer gets in trouble due to an OOM or a runaway computation or
> whatever that we want to preempt somehow. It crashes or gets killed
> externally.
> So whether close() is called or not in the dying process, I don't know.
> But clearly the subscribe is called after a restart.
> I understand that we are out of luck with this. We would have to separate
> the crashing part out into a different operating system process, but must
> keep the consumer running all time. :-(
> Thanks for the insight
> Harald
> On 06.12.2016 19:26, Gwen Shapira wrote:
>> Can you clarify what you mean by "restart"? If you call
>> consumer.close() and consumer.subscribe() you will definitely trigger
>> a rebalance.
>> It doesn't matter if its "same consumer knocking", we already
>> rebalance when you call consumer.close().
>> Since we want both consumer.close() and consumer.subscribe() to cause
>> rebalance immediately (and not wait for heartbeat), I don't think
>> we'll be changing their behavior.
>> Depending on why consumers need to restart, I'm wondering if you can
>> restart other threads in your application but keep the consumer up and
>> running to avoid the rebalances.
>> On Tue, Dec 6, 2016 at 7:18 AM, Harald Kirsch <harald.kirsch@raytion.com>
>> wrote:
>>> We have consumer processes which need to restart frequently, say, every 5
>>> minutes. We have 10 of them so we are facing two restarts every minute on
>>> average.
>>> 1) It seems that nearly every time a consumer restarts  the group is
>>> rebalanced. Even if the restart takes less than the heartbeat interval.
>>> 2) My guess is that the group manager just cannot know that the same
>>> consumer is knocking at the door again.
>>> Are my suspicions (1) and (2) correct? Is there a chance to fix this such
>>> that a restart within the heartbeat interval does not lead to a
>>> re-balance?
>>> Would a well defined client.id help?
>>> Regards
>>> Harald


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message