kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewen Cheslack-Postava <e...@confluent.io>
Subject Re: Kafka Connect behaves weird in case of zombie Kafka brokers. Also, zombie brokers?
Date Mon, 10 Apr 2017 02:25:01 GMT
Is that node the only bootstrap broker provided? If the Connect worker was
pinned to *only* that broker, it wouldn't have any chance of recovering
correct cluster information from the healthy brokers.

It sounds like there was a separate problem as well (the broker should have
figured out it was in a bad state wrt ZK), but normally we would expect
Connect to detect the issue, mark the coordinator dead (as it did) and then
refresh metadata to figure out which broker it should be talking to now.
There are some known edge cases around how initial cluster discovery works
which *might* be able to get you stuck in this situation.

-Ewen

On Tue, Mar 21, 2017 at 10:43 PM, Anish Mashankar <anish@systeminsights.com>
wrote:

> Hello everyone,
> We are running a 5 broker Kafka v0.10.0.0 cluster on AWS. Also, the connect
> api is in v0.10.0.0.
> It was observed that the distributed kafka connector went into infinite
> loop of log message of
>
> (Re-)joining group connect-connect-elasticsearch-indexer.
>
> And after a little more digging. There was another infinite loop of set of
> log messages
>
> *Discovered coordinator 1.2.3.4:9092 (id: ____ rack: null) for group x*
>
> *Marking the coordinator 1.2.3.4:9092 <http://1.2.3.4:9092> (id: ____
> rack:
> null) dead for group x*
>
> Restarting Kafka connect did not help.
>
> Looking at zookeeper, we realized that broker 1.2.3.4 had left the
> zookeeper cluster. It had happened due to a timeout when interacting with
> zookeeper. The broker was also the controller. Failover of controller
> happened, and the applications were fine, but few days later, we started
> facing the above mentioned issue. To add to the surprise, the kafka daemon
> was still running on the host but was not trying to contact zookeeper in
> any time. Hence, zombie broker.
>
> Also, a connect cluster spread across multiple hosts was not working,
> however a single connector worked.
>
> After replacing the EC2 instance for the broker 1.2.3.4, kafka connect
> cluster started working fine. (same broker ID)
>
> I am not sure if this is a bug. Kafka connect shouldn't be trying the same
> broker if it is not able establish connection. We use consul for service
> discovery. As broker was running and 9092 port was active, consul reported
> it was working and redirected dns queries to that broker when the connector
> tried to connect. We have disabled DNS caching in the java options, and
> Kafka connect should've tried to connect to some other host each time it
> tried to query, but instead it only tried on 1.2.3.4 broker.
>
> Does kafka connect internally cache DNS? Is there a debugging step I am
> missing here?
> --
> Anish Samir Mashankar
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message