kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brett Rann <br...@zendesk.com.INVALID>
Subject Re: Unavailable partitions after upgrade to kafka 1.0.0
Date Mon, 23 Apr 2018 07:25:54 GMT
Firstly, 1.0.1 is out and I'd strongly advise you to use that as the
upgrade path over 1.0.0 if you can because it contains a lot of bugfixes.
Some critical.

With unclean leader elections it should have resolved itself when the
affected broker came back online and all partitions were available. So
probably there was an issue there.

Personally I had a lot of struggles upgrading off of 0.10 with bugged large
consumer offset partitions (10s and 100s of GBs) that had stopped
compacting and should have been in the MBs. The largest ones took 45
minutes to compact which spread out the rolling upgrade time significantly.
Also occasionally even with a clean shutdown there was corruption detected
on broker start and it took time for the repair -- a /lot/ of time. In both
cases it was easily seen in the logs, and significantly increased disk IO
metrics on boot (and metrics for FD use gradually returning to previous

Was it all with the one broker, or across multiple?  Did you follow the
rolling upgrade procedure? At what point in the rolling process did the
first issue appear?

https://kafka.apache.org/10/documentation/#upgrade  (that's for 1.0.x)

On Mon, Apr 23, 2018 at 4:04 PM, Mika Linnanoja <mika.linnanoja@rovio.com>

> Hello,
> Last week I upgraded one relatively large kafka (EC2, 10 brokers, ~30 TB
> data, 100-300 Mbps in/out per instance)
> <> cluster to
> 1.0, and saw
> some issues.
> Out of ~100 topics with 2..20 partitions each, 9 partitions in 8 topics
> become "unavailable" across 3 brokers. The leader was shown as -1 and ISR
> was empty. Java service using
> <> clients was
> unable to send any data
> to these partitions so it got dropped.
> The partitions were shown on the `kafka/bin/kafka-topics.sh --zookeeper
> <zk's> --unavailable-partitions --describe` output. Nothing special about
> these partitions, among them were big ones (hundreds of gigs) and tiny ones
> (megabytes).
> The fix was to set up the unclean leader elections and restart one of the
> affected brokers in each partition: `kafka/bin/kafka-configs.sh --zookeeper
> <zk's> --entity-type topics --entity-name <topicname> --add-config
> unclean.leader.election.enable=true --alter`.
> Anyone seen something like this, how to avoid it when next upgrading
> perchance? Maybe it would be better if said cluster got no traffic during
> upgrade, but we cannot have a maintenance break as everything is up 24/7.
> Cluster is for analytics data, some of which is consumed in real-time
> applications, mostly by secor.
> BR,
> Mika
> --
> *Mika Linnanoja*
> Senior Cloud Engineer
> Games Technology
> Rovio Entertainment Corp
> Keilaranta 7
> <https://maps.google.com/?q=Keilaranta+7&entry=gmail&source=g>, FIN -
> 02150 Espoo, Finland
> mika.linnanoja@rovio.com
> www.rovio.com <http://www.rovio.com>


Brett Rann

Senior DevOps Engineer

Zendesk International Ltd

395 Collins Street, Melbourne VIC 3000 Australia

Mobile: +61 (0) 418 826 017

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message