kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Falkenberg <thomas.falkenb...@payback.net>
Subject kafka in undefined state after zookeeper network issues
Date Fri, 04 Nov 2016 16:44:36 GMT
Hi everyone,

Our setup: 3 node kafka (0.9.0.1) / 3 node zookeeper (3.4.6-1569965) setup on 6 separate
VMware instances. We have around 10 topics with 32 partitions. Offsets are managed by Kafka.


We have had a production issue and we suspect it was caused by VMWare snapshot backups leading
to ReadTimeouts on the Zookeeper followers, causing a re-election. Zookeeper leader changed
from zk2 to zk3, zk2 came back and joined, then there was another timeout logged by zk2 but
it rejoined quickly. So far so good, I would call this expected behavior. 

Unfortunately our (standard kakfa java) producers had an issue afterwards, specifically  with
kafka2. It reported errors in its log such as "Number of insync replicas for partition [<partition>,29]
is [1], below required minimum [2]" which caused the publish attempts to fail with the same
message causing end-user request to fail. I suspect that this happened for certain partitions
only (for those where kafka2 was supposed to be the leader). Messages could be processed fine
by kafka1 and kafka3, they just shrinked their ISR from 1,2,3 (or similar) to 1,3. So for
some reason it looks like the controller still forwarded messages to kafka2, which had only
1 ISR replica and thus declined to write them. How could this happen? 

After a restart of all zookeeper and kafka instances (quickly, in order to restore production
asap) everything worked fine again. The only information we have is the log files, we don't
have the output of a "--describe" of the topics or anything like that. I can provide the logs
but I have to upload them somewhere and strip out sensitive data. 

I then tried to reproduce the behavior locally with a similar set of 6 VMs and equal config.
I created a test-topic with 32 partitions and started up a producer and a consumer. 

The producer is running this way: 
while true;do date;sleep 1;done|./kafka-run-class.sh  kafka.tools.ConsoleProducer  --broker-list
192.168.104.6:17111,192.168.104.7:17111,192.168.104.8:17111 --topic test32 --request-required-acks
-1

The consumer: 
./kafka-run-class.sh  kafka.tools.ConsoleConsumer --bootstrap-server 192.168.104.6:17111,192.168.104.7:17111,192.168.104.8:17111
--topic test32 --from-beginning --new-consumer

Then I tweaked with the network on the zookeeper leader using IP tables causing re-elections
at random intervals (see this script: https://thepasteb.in/p/0ghDcWxKQq9g8uz). I ran the script
like so: "while true; do ./iptables_block_zk_leader.sh block;sleep $((RANDOM%10)); ./iptables_block_zk_leader.sh
unblock;end". 
It ran over night and the next day I had a somewhat broken state regarding kafka (and I stopped
the above script to analyze):

1. Only kafka2 is registered with zookeeper (ls /brokers/ids -> [2]). Isr only shows kafka1
and kafka2, kafka3 seems to be out of the game. 
./kafka-topics.sh --zookeeper 192.168.104.3:17001 --topic test32 --describe:
Topic:test32    PartitionCount:32       ReplicationFactor:3     Configs:
        Topic: test32   Partition: 0    Leader: 2       Replicas: 3,1,2 Isr: 2,1
        Topic: test32   Partition: 1    Leader: 2       Replicas: 1,2,3 Isr: 2,1 
	  [...]


2. Kafka1 is constantly logging:
[2016-11-04 17:14:51,928] INFO Partition [__consumer_offsets,19] on broker 1: Cached zkVersion
[120] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2016-11-04 17:15:01,927] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR
for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition)
[2016-11-04 17:15:01,927] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR
for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition)
[2016-11-04 17:15:01,930] INFO Partition [__consumer_offsets,19] on broker 1: Cached zkVersion
[120] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2016-11-04 17:15:01,930] INFO Partition [__consumer_offsets,19] on broker 1: Cached zkVersion
[120] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2016-11-04 17:15:11,926] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR
for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition)
[2016-11-04 17:15:11,926] INFO Partition [__consumer_offsets,19] on broker 1: Shrinking ISR
for partition [__consumer_offsets,19] from 2,1,3 to 1,3 (kafka.cluster.Partition)
3. Kafka2 is not logging any issues. 
4. Kafka3 is constantly logging:
[2016-11-04 14:34:13,246] INFO Partition [test32,30] on broker 3: Shrinking ISR for partition
[test32,30] from 3,1,2 to 3 (kafka.cluster.Partition)
[2016-11-04 14:34:13,246] INFO Partition [test32,30] on broker 3: Shrinking ISR for partition
[test32,30] from 3,1,2 to 3 (kafka.cluster.Partition)
[2016-11-04 14:34:13,252] INFO Partition [test32,30] on broker 3: Cached zkVersion [97] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2016-11-04 14:34:13,252] INFO Partition [test32,30] on broker 3: Cached zkVersion [97] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [test32,13]
to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [test32,13]
to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [__consumer_offsets,47]
to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2016-11-04 14:34:14,109] ERROR [ReplicaFetcherThread-0-1], Error for partition [__consumer_offsets,47]
to broker 1:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
5. Zookeeper is in a stable state as far as I can tell, I see no errors in the logs. ruok
returns iamok on all 3 nodes. zxid is equal. 

6. Producer and consumer are still running fine!
There were occasional errors overnight like:
...
[2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 33230 on topic-partition
test32-5, retrying (1 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 02:43:04,620] ERROR Error when sending message to topic test32 with key: null,
value: 28 bytes with error: The server disconnected before a response was received. (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
[2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 33229 on topic-partition
test32-20, retrying (0 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 02:43:04,620] WARN Got error produce response with correlation id 33229 on topic-partition
test32-14, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
...

But if the producer is only directed to the "broken" kafka3  (--bootstrap-server 192.168.104.8:17111),
it receives the following warnings:
[kafka_xx@kafka2 /apps/kafka/logs/kafka]$ while true;do date;sleep 1;done|/apps/kafka/kafka/bin/kafka-run-class.sh
 kafka.tools.ConsoleProducer  --broker-list 192.168.104.8:17111 --topic test32 --request-required-acks
-1
[2016-11-04 14:24:28,708] WARN Got error produce response with correlation id 2 on topic-partition
test32-19, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition
test32-27, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition
test32-12, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition
test32-28, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition
test32-18, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition
test32-3, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)
[2016-11-04 14:24:30,204] WARN Got error produce response with correlation id 1 on topic-partition
test32-9, retrying (2 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender)

Still it the messages are being produced and can be consumed by the console consumer. So I
am a bit puzzled by this state. Maybe I have reproduced it this way but the Java Producer
we have in our app behaved differently and cached something so messages still got routed to
the "sick" kafka node. 


Any help would be very appreciated!

Thanks
Thomas

Mime
View raw message