kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ismael Juma <ism...@juma.me.uk>
Subject Re: Insanely long recovery time with Kafka 0.11.0.2
Date Mon, 15 Jan 2018 09:04:43 GMT
Hi James,

There was a bug in 0.11.0.0 that could cause all segments to be scanned
during a restart. I believe that was fixed in subsequent 0.11.0.x releases.

Ismael

On Fri, Jan 12, 2018 at 6:49 AM, James Cheng <wushujames@gmail.com> wrote:

> We saw this as well, when updating from 0.10.1.1 to 0.11.0.1.
>
> Have you restarted your brokers since then? Did it take 8h to start up
> again, or did it take its normal 45 minutes?
>
> I don't think it's related to the crash/recovery. Rather, I think it's due
> to the upgrade from 0.10.1.1 to 0.11.0.1
>
> I think 0.11.0.0 introduced some new files in the log directories (maybe
> the .snapshot files?). The first time 0.11.0.0 (or newer) started up, it
> scanned the entire .log files to create... something. It scanned *all* the
> segments, not just the most recent ones. I think that's why it took so
> long. I think normally log recovery only looks at the most recent segments.
>
> We noticed this only on the FIRST boot when running 0.11+. From then on,
> startups were our normal length of time.
>
> In your https://pastebin.com/tZqze4Ya, I see a line like:
> [2018-01-05 13:43:34,776] INFO Completed load of log webapi-event-1 with 2
> log segments, log start offset 1068104 and log end offset 1236587 in 9547
> ms (kafka.log.Log)
>
> That line says that that partition took 9547ms (9.5 seconds) to
> load/recover. We had large partitions that took 30 minutes to recover, on
> that first boot. When I used strace to see what I/O the broker was doing,
> it was reading ALL the segments for the partitions.
>
> -James
>
>
>
> > On Jan 11, 2018, at 10:56 AM, Vincent Rischmann <vincent@rischmann.fr>
> wrote:
> >
> > If anyone else has any idea, I'd love to hear it.
> >
> > Meanwhile, I'll resume upgrading my brokers and hope it doesn't crash
> and/or take so much time for recovery.
> >
> > On Sat, Jan 6, 2018, at 7:25 PM, Vincent Rischmann wrote:
> >> Hi,
> >>
> >> just to clarify: this is the cause of the crash
> >> https://pastebin.com/GuF60kvF in the broker logs, which is why I
> >> referenced https://issues.apache.org/jira/browse/KAFKA-4523
> >>
> >> I had this crash some time ago and yesterday was in the process of
> >> upgrading my brokers to 0.11.0.2 in part to address this bug,
> >> unfortunately while stopping this particular broker it crashed.
> >>
> >> What I don't understand is why the recovery time after upgrading was so
> >> high. A couple of month ago when a broker crashed due to this bug and
> >> recovered it didn't take nearly as long. In fact, I never had a recovery
> >> that long on any broker, even when they suffered a kernel panic or power
> >> failure.
> >>
> >> We have quite a bit of data, however as I said with 0.10.1.1 it "only"
> >> took around 45 minutes. The broker did do a lot of I/O while recovering
> >> (to the point where even just running ls was painfully slow) but afair
> >> it was the same every time a broker recovered. Volume of data hasn't
> >> changed much since the last crash with 0.10.1.1, in fact I removed a lot
> >> of data recently.
> >>
> >> Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while
> >> recovering yesterday atop reported around 200MB/s of read throughput.
> >>
> >> Here are some graphs from our monitoring:
> >>
> >> - CPU usage https://vrischmann.me/files/fr3/cpu.png
> >> - Disk IO https://vrischmann.me/files/fr3/disk_io.png and
> >> https://vrischmann.me/files/fr3/disk_usage.png
> >>
> >> On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote:
> >>> Ismael:
> >>> We're on the same page.
> >>>
> >>> 0.11.0.2 was released on 17 Nov 2017.
> >>>
> >>> By 'recently' in my previous email I meant the change was newer.
> >>>
> >>> Vincent:
> >>> Did the machine your broker ran on experience power issue ?
> >>>
> >>> Cheers
> >>>
> >>> On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ismael@juma.me.uk> wrote:
> >>>
> >>>> Hi Ted,
> >>>>
> >>>> The change you mention is not part of 0.11.0.2.
> >>>>
> >>>> Ismael
> >>>>
> >>>> On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>>>
> >>>>> bq. WARN Found a corrupted index file due to requirement failed:
> Corrupt
> >>>>> index found, index file
> >>>>> (/data/kafka/data-processed-15/00000000000054942918.index)
> >>>>>
> >>>>> Can you search backward for 00000000000054942918.index in the log
to
> see
> >>>> if
> >>>>> we can find the cause for corruption ?
> >>>>>
> >>>>> This part of code was recently changed by :
> >>>>>
> >>>>> KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden
log
> >>>>> recovery
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann <
> vincent@rischmann.fr>
> >>>>> wrote:
> >>>>>
> >>>>>> Here's an excerpt just after the broker started:
> >>>>>> https://pastebin.com/tZqze4Ya
> >>>>>>
> >>>>>> After more than 8 hours of recovery the broker finally started.
I
> >>>> haven't
> >>>>>> read through all 8 hours of log but the parts I looked at are
like
> the
> >>>>>> pastebin.
> >>>>>>
> >>>>>> I'm not seeing much in the log cleaner logs either, they look
> normal.
> >>>> We
> >>>>>> have a couple of compacted topics but seems only the consumer
> offsets
> >>>> is
> >>>>>> ever compacted (the other topics don't have much traffic).
> >>>>>>
> >>>>>> On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote:
> >>>>>>> What do the broker logs say its doing during all that time?
> >>>>>>>
> >>>>>>> There are some consumer offset / log cleaner bugs which
caused us
> >>>>>> similarly
> >>>>>>> log delays. that was easily visible by watching the log
cleaner
> >>>>> activity
> >>>>>> in
> >>>>>>> the logs, and in our monitoring of partition sizes watching
them go
> >>>>> down,
> >>>>>>> along with IO activity on the host for those files.
> >>>>>>>
> >>>>>>> On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <
> >>>>> vincent@rischmann.fr>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2
to fix this
> >>>> bug
> >>>>>>>> https://issues.apache.org/jira/browse/KAFKA-4523
> >>>>>>>> <https://issues.apache.org/jira/browse/KAFKA-4523>
> >>>>>>>> Unfortunately while stopping one broker, it crashed
exactly
> because
> >>>>> of
> >>>>>>>> this bug. No big deal usually, except after restarting
Kafka in
> >>>>>> 0.11.0.2
> >>>>>>>> the recovery is taking a really long time.
> >>>>>>>> I have around 6TB of data on that broker, and before
when it
> >>>> crashed
> >>>>> it
> >>>>>>>> usually took around 30 to 45 minutes to recover, but
now I'm at
> >>>>> almost
> >>>>>>>> 5h since Kafka started and it's still not recovered.
> >>>>>>>> I'm wondering what could have changed to have such a
dramatic
> >>>> effect
> >>>>> on
> >>>>>>>> recovery time ? Is there maybe something I can tweak
to try to
> >>>> reduce
> >>>>>>>> the time ?
> >>>>>>>> Thanks.
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message