kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Bengtsson <marcus.bengts...@cactusrail.se>
Subject Detection of lost kafka node
Date Fri, 05 May 2017 11:17:06 GMT
Hi all!

We are running Kafka in a 3 node setup with Kafka and Zookeeper on each node. The topics have
1 partition and 2 replicas, like:

Topic:someTopic    PartitionCount:1    ReplicationFactor:2    Configs:retention.ms=600000
    Topic: someTopic    Partition: 0    Leader: 2    Replicas: 2,0    Isr: 2,0

We uses the following settings

Consumer settings:
fetch.min.bytes=1
enable.auto.commit=true
max.partition.fetch.bytes=1073741824

Producer settings:
metadata.fetch.timeout.ms=1000

If we stop Kafka and Zookeeper on one node with 'kill -9', Kafka detects that the leader is
missing within seconds and switches leader to the other replica and consumers will continue
to receive messages.

If we on the other hand bring down the network for the same node with 'ifdown eth0' (which
will break the connection to both Kafka and Zookeeper on that node) it seems like Kafka have
problems detecting that the broker is missing and it takes up to 2 minutes until any more
messages can be consumed on affected topics.

The following log can be seen on the consumer :
[2017-05-04 15:44:26,916] WARN Auto offset commit failed for group console-consumer-75510:
Commit offsets failed with retriable exception. You should retry committing offsets. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

and on the producer:
May 04 15:44:18: 15:44:18.420 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'someTopic' failed
May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.
May 04 15:44:18: 15:44:18.435 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'someTopic' failed
May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.
May 04 15:44:18: 15:44:18.440 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'someTopic' failed
May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.
May 04 15:44:18: 15:44:18.442 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'someTopic' failed
May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.
May 04 15:44:18: 15:44:18.444 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'someTopic' failed
May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.
May 04 15:44:18: org.apache.kafka.common.errors.TimeoutException: Batch containing 31 record(s)
expired due to timeout while requesting metadata from brokers for someTopic-0
May 04 15:44:18: 15:44:18.446 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'Heartbeat.Heartbeat' failed
May 04 15:44:18: org.apache.kafka.common.errors.TimeoutException: Batch containing 31 record(s)
expired due to timeout while requesting metadata from brokers for someTopic-0
May 04 15:44:18: 15:44:18.448 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'Heartbeat.Heartbeat' failed
May 04 15:44:18: org.apache.kafka.common.errors.TimeoutException: Batch containing 31 record(s)
expired due to timeout while requesting metadata from brokers for someTopic-0
May 04 15:44:18: 15:44:18.449 [kafka-producer-network-thread | producer-2] ERROR - app Publishing
to topic 'Heartbeat.Heartbeat' failed
... will continue print those for a while

________________________
This email was scanned by Bitdefender

Mime
View raw message