kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Rischmann <vinc...@rischmann.fr>
Subject Re: Insanely long recovery time with Kafka 0.11.0.2
Date Thu, 11 Jan 2018 18:56:19 GMT
If anyone else has any idea, I'd love to hear it.

Meanwhile, I'll resume upgrading my brokers and hope it doesn't crash and/or take so much
time for recovery.

On Sat, Jan 6, 2018, at 7:25 PM, Vincent Rischmann wrote:
> Hi,
> 
> just to clarify: this is the cause of the crash 
> https://pastebin.com/GuF60kvF in the broker logs, which is why I 
> referenced https://issues.apache.org/jira/browse/KAFKA-4523
> 
> I had this crash some time ago and yesterday was in the process of 
> upgrading my brokers to 0.11.0.2 in part to address this bug, 
> unfortunately while stopping this particular broker it crashed.
> 
> What I don't understand is why the recovery time after upgrading was so 
> high. A couple of month ago when a broker crashed due to this bug and 
> recovered it didn't take nearly as long. In fact, I never had a recovery 
> that long on any broker, even when they suffered a kernel panic or power 
> failure.
> 
> We have quite a bit of data, however as I said with 0.10.1.1 it "only" 
> took around 45 minutes. The broker did do a lot of I/O while recovering 
> (to the point where even just running ls was painfully slow) but afair 
> it was the same every time a broker recovered. Volume of data hasn't 
> changed much since the last crash with 0.10.1.1, in fact I removed a lot 
> of data recently.
> 
> Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while 
> recovering yesterday atop reported around 200MB/s of read throughput. 
> 
> Here are some graphs from our monitoring:
> 
>  - CPU usage https://vrischmann.me/files/fr3/cpu.png
>  - Disk IO https://vrischmann.me/files/fr3/disk_io.png and 
> https://vrischmann.me/files/fr3/disk_usage.png
> 
> On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote:
> > Ismael:
> > We're on the same page.
> > 
> > 0.11.0.2 was released on 17 Nov 2017.
> > 
> > By 'recently' in my previous email I meant the change was newer.
> > 
> > Vincent:
> > Did the machine your broker ran on experience power issue ?
> > 
> > Cheers
> > 
> > On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ismael@juma.me.uk> wrote:
> > 
> > > Hi Ted,
> > >
> > > The change you mention is not part of 0.11.0.2.
> > >
> > > Ismael
> > >
> > > On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > > > bq. WARN Found a corrupted index file due to requirement failed: Corrupt
> > > > index found, index file
> > > > (/data/kafka/data-processed-15/00000000000054942918.index)
> > > >
> > > > Can you search backward for 00000000000054942918.index in the log to see
> > > if
> > > > we can find the cause for corruption ?
> > > >
> > > > This part of code was recently changed by :
> > > >
> > > > KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden log
> > > > recovery
> > > >
> > > > Cheers
> > > >
> > > > On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann <vincent@rischmann.fr>
> > > > wrote:
> > > >
> > > > > Here's an excerpt just after the broker started:
> > > > > https://pastebin.com/tZqze4Ya
> > > > >
> > > > > After more than 8 hours of recovery the broker finally started. I
> > > haven't
> > > > > read through all 8 hours of log but the parts I looked at are like
the
> > > > > pastebin.
> > > > >
> > > > > I'm not seeing much in the log cleaner logs either, they look normal.
> > > We
> > > > > have a couple of compacted topics but seems only the consumer offsets
> > > is
> > > > > ever compacted (the other topics don't have much traffic).
> > > > >
> > > > > On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote:
> > > > > > What do the broker logs say its doing during all that time?
> > > > > >
> > > > > > There are some consumer offset / log cleaner bugs which caused
us
> > > > > similarly
> > > > > > log delays. that was easily visible by watching the log cleaner
> > > > activity
> > > > > in
> > > > > > the logs, and in our monitoring of partition sizes watching
them go
> > > > down,
> > > > > > along with IO activity on the host for those files.
> > > > > >
> > > > > > On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <
> > > > vincent@rischmann.fr>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to
fix this
> > > bug
> > > > > > > https://issues.apache.org/jira/browse/KAFKA-4523
> > > > > > > <https://issues.apache.org/jira/browse/KAFKA-4523>
> > > > > > > Unfortunately while stopping one broker, it crashed exactly
because
> > > > of
> > > > > > > this bug. No big deal usually, except after restarting
Kafka in
> > > > > 0.11.0.2
> > > > > > > the recovery is taking a really long time.
> > > > > > > I have around 6TB of data on that broker, and before when
it
> > > crashed
> > > > it
> > > > > > > usually took around 30 to 45 minutes to recover, but now
I'm at
> > > > almost
> > > > > > > 5h since Kafka started and it's still not recovered.
> > > > > > > I'm wondering what could have changed to have such a dramatic
> > > effect
> > > > on
> > > > > > > recovery time ? Is there maybe something I can tweak to
try to
> > > reduce
> > > > > > > the time ?
> > > > > > > Thanks.
> > > > > > >
> > > > >
> > > >
> > >

Mime
View raw message