kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Rischmann <vinc...@rischmann.fr>
Subject Re: Insanely long recovery time with Kafka 0.11.0.2
Date Sat, 06 Jan 2018 18:25:00 GMT
Hi,

just to clarify: this is the cause of the crash https://pastebin.com/GuF60kvF in the broker
logs, which is why I referenced https://issues.apache.org/jira/browse/KAFKA-4523

I had this crash some time ago and yesterday was in the process of upgrading my brokers to
0.11.0.2 in part to address this bug, unfortunately while stopping this particular broker
it crashed.

What I don't understand is why the recovery time after upgrading was so high. A couple of
month ago when a broker crashed due to this bug and recovered it didn't take nearly as long.
In fact, I never had a recovery that long on any broker, even when they suffered a kernel
panic or power failure.

We have quite a bit of data, however as I said with 0.10.1.1 it "only" took around 45 minutes.
The broker did do a lot of I/O while recovering (to the point where even just running ls was
painfully slow) but afair it was the same every time a broker recovered. Volume of data hasn't
changed much since the last crash with 0.10.1.1, in fact I removed a lot of data recently.

Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while recovering yesterday
atop reported around 200MB/s of read throughput. 

Here are some graphs from our monitoring:

 - CPU usage https://vrischmann.me/files/fr3/cpu.png
 - Disk IO https://vrischmann.me/files/fr3/disk_io.png and https://vrischmann.me/files/fr3/disk_usage.png

On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote:
> Ismael:
> We're on the same page.
> 
> 0.11.0.2 was released on 17 Nov 2017.
> 
> By 'recently' in my previous email I meant the change was newer.
> 
> Vincent:
> Did the machine your broker ran on experience power issue ?
> 
> Cheers
> 
> On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ismael@juma.me.uk> wrote:
> 
> > Hi Ted,
> >
> > The change you mention is not part of 0.11.0.2.
> >
> > Ismael
> >
> > On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > bq. WARN Found a corrupted index file due to requirement failed: Corrupt
> > > index found, index file
> > > (/data/kafka/data-processed-15/00000000000054942918.index)
> > >
> > > Can you search backward for 00000000000054942918.index in the log to see
> > if
> > > we can find the cause for corruption ?
> > >
> > > This part of code was recently changed by :
> > >
> > > KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden log
> > > recovery
> > >
> > > Cheers
> > >
> > > On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann <vincent@rischmann.fr>
> > > wrote:
> > >
> > > > Here's an excerpt just after the broker started:
> > > > https://pastebin.com/tZqze4Ya
> > > >
> > > > After more than 8 hours of recovery the broker finally started. I
> > haven't
> > > > read through all 8 hours of log but the parts I looked at are like the
> > > > pastebin.
> > > >
> > > > I'm not seeing much in the log cleaner logs either, they look normal.
> > We
> > > > have a couple of compacted topics but seems only the consumer offsets
> > is
> > > > ever compacted (the other topics don't have much traffic).
> > > >
> > > > On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote:
> > > > > What do the broker logs say its doing during all that time?
> > > > >
> > > > > There are some consumer offset / log cleaner bugs which caused us
> > > > similarly
> > > > > log delays. that was easily visible by watching the log cleaner
> > > activity
> > > > in
> > > > > the logs, and in our monitoring of partition sizes watching them
go
> > > down,
> > > > > along with IO activity on the host for those files.
> > > > >
> > > > > On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <
> > > vincent@rischmann.fr>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix
this
> > bug
> > > > > > https://issues.apache.org/jira/browse/KAFKA-4523
> > > > > > <https://issues.apache.org/jira/browse/KAFKA-4523>
> > > > > > Unfortunately while stopping one broker, it crashed exactly
because
> > > of
> > > > > > this bug. No big deal usually, except after restarting Kafka
in
> > > > 0.11.0.2
> > > > > > the recovery is taking a really long time.
> > > > > > I have around 6TB of data on that broker, and before when it
> > crashed
> > > it
> > > > > > usually took around 30 to 45 minutes to recover, but now I'm
at
> > > almost
> > > > > > 5h since Kafka started and it's still not recovered.
> > > > > > I'm wondering what could have changed to have such a dramatic
> > effect
> > > on
> > > > > > recovery time ? Is there maybe something I can tweak to try
to
> > reduce
> > > > > > the time ?
> > > > > > Thanks.
> > > > > >
> > > >
> > >
> >

Mime
View raw message