kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Huus <evan.h...@shopify.com>
Subject Re: Fetch Request Purgatory and Mirrormaker
Date Tue, 14 Apr 2015 13:55:11 GMT
Any ideas on this? It's still occurring...

Is there a separate mailing list or project for mirrormaker that I could
ask?

Thanks,
Evan

On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <evan.huus@shopify.com> wrote:

> Hey Folks, we're running into an odd issue with mirrormaker and the fetch
> request purgatory on the brokers. Our setup consists of two six-node
> clusters (all running 0.8.2.1 on identical hw with the same config). All
> "normal" producing and consuming happens on cluster A. Mirrormaker has been
> set up to copy all topics (except a tiny blacklist) from cluster A to
> cluster B.
>
> Cluster A is completely healthy at the moment. Cluster B is not, which is
> very odd since it is literally handling the exact same traffic.
>
> The graph for Fetch Request Purgatory Size looks like this:
> https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08.37.png?dl=0
>
> Every time the purgatory shrinks, the latency from that causes one or more
> nodes to drop their leadership (it quickly recovers). We could probably
> alleviate the symptoms by decreasing
> `fetch.purgatory.purge.interval.requests` (it is currently at the default
> value) but I'd rather try and understand/solve the root cause here.
>
> Cluster B is handling no outside fetch requests, and turning mirrormaker
> off "fixes" the problem, so clearly (since mirrormaker is producing to this
> cluster not consuming from it) the fetch requests must be coming from
> internal replication. However, the same data is being replicated when it is
> originally produced in cluster A, and the fetch purgatory size sits stably
> at ~10k there. There is nothing unusual in the logs on either cluster.
>
> I have read all the wiki pages and jira tickets I can find about the new
> purgatory design in 0.8.2 but nothing stands out as applicable. I'm happy
> to provide more detailed logs, configuration, etc. if anyone thinks there
> might be something important in there. I am completely baffled as to what
> could be causing this.
>
> Any suggestions would be appreciated. I'm starting to think at this point
> that we've completely misunderstood or misconfigured *something*.
>
> Thanks,
> Evan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message