kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "shenguanghui@unionpay.com" <shenguang...@unionpay.com>
Subject partition get underreplicated and stuck, descibe command shows the leader is a dead broker id
Date Tue, 19 Nov 2019 10:59:00 GMT

kafka partitions get underreplicated, with a single ISR, and doesn't recover. 
I have 8 brokers and several topics with 3 replicas for every topic. broker id is from 0 to
7. One day broker 0 got a young gc for 3.29 seconds and after that some partitions reduce
its isr from 3 to 1, the log here is:

[2019-11-08 13:35:00,821] INFO Partition [dcs_async_redis_to_db,7] on broker 0: Shrinking
ISR from 0,1,2 to 0 (kafka.cluster.Partition)
[2019-11-08 13:35:00,824] INFO Partition [__consumer_offsets,15] on broker 0: Shrinking ISR
from 0,1,2 to 0,1 (kafka.cluster.Partition)
 there are many timeout exceptions on producers during the gc process. after a while,  other
7 brokers say that consistently:

[2019-11-08 13:35:24,241] WARN [ReplicaFetcherThread-0-0]: Error in fetch to broker 0, request
(type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__cons
umer_offsets-7=(offset=44372693, logStartOffset=0, maxBytes=1048576), __consumer_offsets-15=(offset=78350976,
logStartOffset=0, maxBytes=1048576), dcs_async_redis_to_db-7=(offset=758846267,
 logStartOffset=757998253, maxBytes=1048576)}) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
        at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:93)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:207)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:151)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

the log is a shortcut of broker id 1 which is same like other brokers.  what more strange
is I tried to kill broker 0 but failed and killed it with -9 finally.  and after I killed
broker 0, the topic and partition  [dcs_async_redis_to_db,7]  also showed that its leader
is broker 0 when I described the topics status on other broker with --describe command. I
am sure that borker of id 0 had been killed at that time. Finally after I restarted the broker
0, the cluster return back to correct status, however there were some accidents during the
process, but I think there was nothing related with the trouble what I am confused with.

I search the issues of kafka, related some are:

the issue 4477 shows that it has been fixed but I cannot find commit log or code or patch
related. Beggar for your help. I have the kafka logs during the whole time if you want.

中国银联 科技事业部 云闪付团队
电话:20633284 | 13696519872
上海市浦东新区顾唐路1699号 中国银联园区

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message