kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Garcia <dav...@spiceworks.com>
Subject Re: Slow machine disrupting the cluster
Date Fri, 16 Sep 2016 14:41:19 GMT
To remediate, you could start another broker, rebalance, and then shut down the busted broker.
 But, you really should put some monitoring on your system (to help diagnose the actual problem).
 Datadog has a pretty good set of articles for using jmx to do this: https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

There are lots of jmx metrics gathering tools too…such as jmxtrans: https://github.com/jmxtrans/jmxtrans

confluent also offers tooling (such as command center) to help with monitoring.

As far as mirror maker goes, you can play with the consumer/producer timeout settings to make
sure the process waits long enough for a slow machine.


On 9/16/16, 7:11 AM, "Gerard Klijs" <gerard.klijs@dizzit.com> wrote:

    We just had an interesting issue, luckily this was only on our test cluster.
    Because of some reason one of the machines in a cluster became really slow.
    Because it was still alive, it stil was the leader for some
    topic-partitions. Our mirror maker reads and writes to multiple
    topic-partitions on each thread. When committing the offsets this will fail
    for the topic-partitions located on the slow machine, because the consumers
    have timed out. The data for these topic-partitions will be send over and
    over, causing a flood of duplicate messages.
    What would be the best way to prevent this in the future. Is there some way
    the broker could notice it's performing poorly and shut's off for example?

View raw message