kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Farmer <m...@frmr.me>
Subject Re: Insanely long recovery time with Kafka 0.11.0.2
Date Sat, 06 Jan 2018 15:30:20 GMT
This is “normal” as far as I know. We’ve seen this behavior after unclean shutdowns of
0.10.1.1.

In the event of an unclean shutdown Kafka seems to have to rebuild some indexes and for large
data directories this takes some time. We got bit by this a few times recently when we had
boxes that powered off unexpectedly which resulted in 2 hours of rebuilding indexes before
the brokers returned to a healthy state.


> On Jan 6, 2018, at 10:18 AM, Vincent Rischmann <vincent@rischmann.fr> wrote:
> 
> Here's an excerpt just after the broker started: https://pastebin.com/tZqze4Ya
> 
> After more than 8 hours of recovery the broker finally started. I haven't read through
all 8 hours of log but the parts I looked at are like the pastebin.
> 
> I'm not seeing much in the log cleaner logs either, they look normal. We have a couple
of compacted topics but seems only the consumer offsets is ever compacted (the other topics
don't have much traffic).
> 
> On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote:
>> What do the broker logs say its doing during all that time?
>> 
>> There are some consumer offset / log cleaner bugs which caused us similarly
>> log delays. that was easily visible by watching the log cleaner activity in
>> the logs, and in our monitoring of partition sizes watching them go down,
>> along with IO activity on the host for those files.
>> 
>> On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <vincent@rischmann.fr>
>> wrote:
>> 
>>> Hello,
>>> 
>>> so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this bug
>>> https://issues.apache.org/jira/browse/KAFKA-4523
>>> <https://issues.apache.org/jira/browse/KAFKA-4523>
>>> Unfortunately while stopping one broker, it crashed exactly because of
>>> this bug. No big deal usually, except after restarting Kafka in 0.11.0.2
>>> the recovery is taking a really long time.
>>> I have around 6TB of data on that broker, and before when it crashed it
>>> usually took around 30 to 45 minutes to recover, but now I'm at almost
>>> 5h since Kafka started and it's still not recovered.
>>> I'm wondering what could have changed to have such a dramatic effect on
>>> recovery time ? Is there maybe something I can tweak to try to reduce
>>> the time ?
>>> Thanks.
>>> 


Mime
View raw message