kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Huus <evan.h...@shopify.com>
Subject Re: Fetch Request Purgatory and Mirrormaker
Date Wed, 15 Apr 2015 00:40:52 GMT
On Tue, Apr 14, 2015 at 8:31 PM, Jiangjie Qin <jqin@linkedin.com.invalid>
wrote:

> Hey Evan,
>
> Is this issue only observed when mirror maker is consuming? It looks that
> for Cluster A you have some other consumers.
> Do you mean if you stop mirror maker the problem goes away?
>

Yes, exactly. The setup is A -> Mirrormaker -> B so mirrormaker is
consuming from A and producing to B.

Cluster A is always fine. Cluster B is fine when mirrormaker is stopped.
Cluster B has the weird purgatory issue when mirrormaker is running.

Today I rolled out a change to reduce the
`fetch.purgatory.purge.interval.requests` and
`producer.purgatory.purge.interval.requests` configuration values on
cluster B from 1000 to 200, but it had no effect, which I find really weird.

Thanks,
Evan


> Jiangjie (Becket) Qin
>
> On 4/14/15, 6:55 AM, "Evan Huus" <evan.huus@shopify.com> wrote:
>
> >Any ideas on this? It's still occurring...
> >
> >Is there a separate mailing list or project for mirrormaker that I could
> >ask?
> >
> >Thanks,
> >Evan
> >
> >On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <evan.huus@shopify.com> wrote:
> >
> >> Hey Folks, we're running into an odd issue with mirrormaker and the
> >>fetch
> >> request purgatory on the brokers. Our setup consists of two six-node
> >> clusters (all running 0.8.2.1 on identical hw with the same config). All
> >> "normal" producing and consuming happens on cluster A. Mirrormaker has
> >>been
> >> set up to copy all topics (except a tiny blacklist) from cluster A to
> >> cluster B.
> >>
> >> Cluster A is completely healthy at the moment. Cluster B is not, which
> >>is
> >> very odd since it is literally handling the exact same traffic.
> >>
> >> The graph for Fetch Request Purgatory Size looks like this:
> >>
> >>
> https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08
> >>.37.png?dl=0
> >>
> >> Every time the purgatory shrinks, the latency from that causes one or
> >>more
> >> nodes to drop their leadership (it quickly recovers). We could probably
> >> alleviate the symptoms by decreasing
> >> `fetch.purgatory.purge.interval.requests` (it is currently at the
> >>default
> >> value) but I'd rather try and understand/solve the root cause here.
> >>
> >> Cluster B is handling no outside fetch requests, and turning mirrormaker
> >> off "fixes" the problem, so clearly (since mirrormaker is producing to
> >>this
> >> cluster not consuming from it) the fetch requests must be coming from
> >> internal replication. However, the same data is being replicated when
> >>it is
> >> originally produced in cluster A, and the fetch purgatory size sits
> >>stably
> >> at ~10k there. There is nothing unusual in the logs on either cluster.
> >>
> >> I have read all the wiki pages and jira tickets I can find about the new
> >> purgatory design in 0.8.2 but nothing stands out as applicable. I'm
> >>happy
> >> to provide more detailed logs, configuration, etc. if anyone thinks
> >>there
> >> might be something important in there. I am completely baffled as to
> >>what
> >> could be causing this.
> >>
> >> Any suggestions would be appreciated. I'm starting to think at this
> >>point
> >> that we've completely misunderstood or misconfigured *something*.
> >>
> >> Thanks,
> >> Evan
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message