kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zoran <zoran.ljubi...@bulbtech.com>
Subject Group consumer cannot consume messages if kafka service on specific node in test cluster is down
Date Tue, 30 Jan 2018 13:59:54 GMT
Hi,

I have three servers:

blade1 (192.168.112.31),
blade2 (192.168.112.32) and
blade3 (192.168.112.33).

On each of servers kafka_2.11-1.0.0 is installed.
On blade3 (192.168.112.33:2181) zookeeper is installed as well.

I have created a topic repl3part5 with the following line:

bin/kafka-topics.sh --zookeeper 192.168.112.33:2181 --create 
--replication-factor 3 --partitions 5 --topic repl3part5

When I describe the topic, it looks like this:

[root@blade1 kafka]# bin/kafka-topics.sh --describe --topic repl3part5 
--zookeeper 192.168.112.33:2181

Topic:repl3part5    PartitionCount:5    ReplicationFactor:3 Configs:
     Topic: repl3part5    Partition: 0    Leader: 2    Replicas: 
2,3,1    Isr: 2,3,1
     Topic: repl3part5    Partition: 1    Leader: 3    Replicas: 
3,1,2    Isr: 3,1,2
     Topic: repl3part5    Partition: 2    Leader: 1    Replicas: 
1,2,3    Isr: 1,2,3
     Topic: repl3part5    Partition: 3    Leader: 2    Replicas: 
2,1,3    Isr: 2,1,3
     Topic: repl3part5    Partition: 4    Leader: 3    Replicas: 
3,2,1    Isr: 3,2,1

I have a producer for this topic:

bin/kafka-console-producer.sh --broker-list 
192.168.112.31:9092,192.168.112.32:9092,192.168.112.33:9092 --topic 
repl3part5

and single consumer:

bin/kafka-console-consumer.sh --bootstrap-server 
192.168.112.31:9092,192.168.112.32:9092,192.168.112.33:9092 --topic 
repl3part5  --consumer-property group.id=zoran_1

Every message that is sent by producer gets collected by consumer. So 
far - so good.

Now I would like to test fail over of the kafka servers. If I put down 
blade 3 kafka service, I get consumer warnings but all produced messages 
are still consumed.

[2018-01-30 14:30:01,203] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 3 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:30:01,299] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 3 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:30:01,475] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 3 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)

Now I have started up kafka service on blade 3 and I have put down kafka 
service on blade 2 server.
Consumer now showed one warning but all produced messages are still 
consumed.

[2018-01-30 14:31:38,164] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 2 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)

Now I have started up kafka service on blade 2 and I have put down kafka 
service on blade 1 server.

Consumer now shows warnings about node 1/2147483646, but also 
Asynchronous auto-commit of offsets ... failed: Offset commit failed 
with a retriable exception. You should retry committing offsets. The 
underlying error was: null.

[2018-01-30 14:33:16,393] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 1 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:16,469] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 2147483646 could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:16,557] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 1 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:16,986] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 2147483646 could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:16,991] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 1 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:17,493] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 2147483646 could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:17,495] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 1 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:18,002] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 2147483646 could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:18,003] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Asynchronous auto-commit of offsets 
{repl3part5-4=OffsetAndMetadata{offset=18, metadata=''}, 
repl3part5-3=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-2=OffsetAndMetadata{offset=19, metadata=''}, 
repl3part5-1=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-0=OffsetAndMetadata{offset=20, metadata=''}} failed: Offset 
commit failed with a retriable exception. You should retry committing 
offsets. The underlying error was: null 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-01-30 14:33:18,611] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 1 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:18,932] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 2147483646 could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:18,933] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Asynchronous auto-commit of offsets 
{repl3part5-4=OffsetAndMetadata{offset=18, metadata=''}, 
repl3part5-3=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-2=OffsetAndMetadata{offset=19, metadata=''}, 
repl3part5-1=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-0=OffsetAndMetadata{offset=20, metadata=''}} failed: Offset 
commit failed with a retriable exception. You should retry committing 
offsets. The underlying error was: null 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-01-30 14:33:19,977] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 2147483646 could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-01-30 14:33:19,978] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Asynchronous auto-commit of offsets 
{repl3part5-4=OffsetAndMetadata{offset=18, metadata=''}, 
repl3part5-3=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-2=OffsetAndMetadata{offset=19, metadata=''}, 
repl3part5-1=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-0=OffsetAndMetadata{offset=20, metadata=''}} failed: Offset 
commit failed with a retriable exception. You should retry committing 
offsets. The underlying error was: null 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-01-30 14:33:19,979] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Connection to node 1 could not be established. Broker 
may not be available. (org.apache.kafka.clients.NetworkClient)

I tried to solve the problem by adding a 
offsets.topic.replication.factor=2 (or 3) on all three server.properties 
file (one of them is attached), but with no success.
My idea was that topic __consumer_offset wasn't replicated throughout 
the cluster, but looks like it is not the case here.

While blade 1 kafka service was down topic describe showed the following:

[root@blade1 kafka]# bin/kafka-topics.sh --describe --topic repl3part5 
--zookeeper 192.168.112.33:2181

Topic:repl3part5    PartitionCount:5    ReplicationFactor:3 Configs:
     Topic: repl3part5    Partition: 0    Leader: 3    Replicas: 
2,3,1    Isr: 3
     Topic: repl3part5    Partition: 1    Leader: 3    Replicas: 
3,1,2    Isr: 3
     Topic: repl3part5    Partition: 2    Leader: 3    Replicas: 
1,2,3    Isr: 3
     Topic: repl3part5    Partition: 3    Leader: 3    Replicas: 
2,1,3    Isr: 3
     Topic: repl3part5    Partition: 4    Leader: 3    Replicas: 
3,2,1    Isr: 3

Producer now shows the following warning, it still puts messages on the 
topic but messages are just raising lag count on partitions:

[2018-01-30 14:37:21,816] WARN [Producer clientId=console-producer] 
Connection to node 1 could not be established. Broker may not be 
available. (org.apache.kafka.clients.NetworkClient)

I noticed that while kafka service on blade1 is alive, I can put down/up 
blade 2 and 3 in any combination and consumer will always be able to 
consume messages.
If kafka service on blade 1 is down, than even if kafka services on 
blade 2 and blade 3 are up and running, consumer cannot consume messages.

After bringing kafka service up on blade 1, all messages that producer 
has sent while kafka service on blade 1 was down are replayed and than 
the following is showed in consumer terminal:

[2018-01-30 14:44:30,817] ERROR [Consumer clientId=consumer-1, 
groupId=zoran_1] Offset commit failed on partition repl3part5-4 at 
offset 20: This is not the correct coordinator. 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-01-30 14:44:30,817] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Asynchronous auto-commit of offsets 
{repl3part5-4=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-3=OffsetAndMetadata{offset=22, metadata=''}, 
repl3part5-2=OffsetAndMetadata{offset=20, metadata=''}, 
repl3part5-1=OffsetAndMetadata{offset=22, metadata=''}, 
repl3part5-0=OffsetAndMetadata{offset=22, metadata=''}} failed: Offset 
commit failed with a retriable exception. You should retry committing 
offsets. The underlying error was: This is not the correct coordinator. 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-01-30 14:44:31,202] ERROR [Consumer clientId=consumer-1, 
groupId=zoran_1] Offset commit failed on partition repl3part5-4 at 
offset 22: This is not the correct coordinator. 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2018-01-30 14:44:31,202] WARN [Consumer clientId=consumer-1, 
groupId=zoran_1] Asynchronous auto-commit of offsets 
{repl3part5-4=OffsetAndMetadata{offset=22, metadata=''}, 
repl3part5-3=OffsetAndMetadata{offset=24, metadata=''}, 
repl3part5-2=OffsetAndMetadata{offset=22, metadata=''}, 
repl3part5-1=OffsetAndMetadata{offset=24, metadata=''}, 
repl3part5-0=OffsetAndMetadata{offset=24, metadata=''}} failed: Offset 
commit failed with a retriable exception. You should retry committing 
offsets. The underlying error was: This is not the correct coordinator. 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

 From now on everything works with no problems or warnings and the 
system is fully functional.

Can someone explain to me why kafka server on blade 1 is so important, 
and what are my options in order to be able to stop any of the two 
servers (including kafka server on blade 1) and be able to consume 
messages with no delay?
This thing drives me crazy. :)

Can you please help?

Regards.

Mime
View raw message