kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Falko <afa...@salesforce.com>
Subject Unrecoverable kafka cluster after failed rolling restart
Date Thu, 18 Jan 2018 04:01:20 GMT
Hi everyone,

Fortunately this didn't happen with a production cluster, but here is
the sequence of events with this 10 broker cluster:
1) Stop all kafka brokers servers (this was a mistake on my part; I
intended to do a rolling restart)
2) Start 5 brokers a few minutes apart each
3) Start the rest of the brokers at roughly the same time
4) While watching my metrics, leader election started happening,
zookeepers started getting busier; then there was a drop off zk
5) At that point, kafkas remained busy. Describe topics showed that
all 62k 3x replicated partitions had leaders, all had their 3
replicas, but 45k had their full ISR, 14k had 2 ISR, 3k 1 ISR.
6) After waiting for 30 minutes, the ISR numbers didn't change.

Zookeeper logs spew this:
[2018-01-18 03:18:56,645] INFO Accepted socket connection from
[2018-01-18 03:18:56,645] INFO Client attempting to renew session
0xab61063d85310002 at /
[2018-01-18 03:18:56,645] INFO Established session 0xab61063d85310002
with negotiated timeout 6000 for client /
[2018-01-18 03:18:56,647] WARN Exception causing close of session
0xab61063d85310002 due to java.io.IOException: Len error 1251065
[2018-01-18 03:18:56,647] INFO Closed socket connection for client
/ which had sessionid 0xab61063d85310002

Kafka logs spew this:
[2018-01-18 03:19:57,751] INFO Session establishment complete on
server ip-172-16-253-167.us-west-1.compute.internal/,
sessionid = 0xab61063d85310002, negotiated
timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2018-01-18 03:19:57,751] INFO zookeeper state changed (SyncConnected)
[2018-01-18 03:19:57,753] INFO Unable to read additional data from
server sessionid 0xab61063d85310002, likely server has closed socket,
closing socket connection and attempting rec
onnect (org.apache.zookeeper.ClientCnxn)
[2018-01-18 03:19:57,854] INFO zookeeper state changed (Disconnected)
[2018-01-18 03:19:57,854] INFO Waiting for keeper state SyncConnected

The only thing that I found is the following link, but I'm running
kafka-1.0 bundled ZK, which is 3.4 and has that fix described in it:

Unfortunately, I am unable to consume or produce to any of the
existing topics. I was able to create new topics and produce to them
at reasonable rates with the environment in the bad state. Consume
from these new topics didn't work.

How should I proceed from here? For some extra context: I'm
researching how many replicated partitions kafka can support before it
falls over and/or starts failing to meet various SLAs we've made with
our users. I'll try to "correct" the original mistake that I made,
however, I hope that an operator error like mine doesn't take out a
production cluster this like this :).

Best regards,
Andrey Falko

View raw message