kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiangjie Qin <j...@linkedin.com.INVALID>
Subject Re: Fetch Request Purgatory and Mirrormaker
Date Wed, 15 Apr 2015 00:31:08 GMT
Hey Evan,

Is this issue only observed when mirror maker is consuming? It looks that
for Cluster A you have some other consumers.
Do you mean if you stop mirror maker the problem goes away?

Jiangjie (Becket) Qin

On 4/14/15, 6:55 AM, "Evan Huus" <evan.huus@shopify.com> wrote:

>Any ideas on this? It's still occurring...
>
>Is there a separate mailing list or project for mirrormaker that I could
>ask?
>
>Thanks,
>Evan
>
>On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <evan.huus@shopify.com> wrote:
>
>> Hey Folks, we're running into an odd issue with mirrormaker and the
>>fetch
>> request purgatory on the brokers. Our setup consists of two six-node
>> clusters (all running 0.8.2.1 on identical hw with the same config). All
>> "normal" producing and consuming happens on cluster A. Mirrormaker has
>>been
>> set up to copy all topics (except a tiny blacklist) from cluster A to
>> cluster B.
>>
>> Cluster A is completely healthy at the moment. Cluster B is not, which
>>is
>> very odd since it is literally handling the exact same traffic.
>>
>> The graph for Fetch Request Purgatory Size looks like this:
>> 
>>https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08
>>.37.png?dl=0
>>
>> Every time the purgatory shrinks, the latency from that causes one or
>>more
>> nodes to drop their leadership (it quickly recovers). We could probably
>> alleviate the symptoms by decreasing
>> `fetch.purgatory.purge.interval.requests` (it is currently at the
>>default
>> value) but I'd rather try and understand/solve the root cause here.
>>
>> Cluster B is handling no outside fetch requests, and turning mirrormaker
>> off "fixes" the problem, so clearly (since mirrormaker is producing to
>>this
>> cluster not consuming from it) the fetch requests must be coming from
>> internal replication. However, the same data is being replicated when
>>it is
>> originally produced in cluster A, and the fetch purgatory size sits
>>stably
>> at ~10k there. There is nothing unusual in the logs on either cluster.
>>
>> I have read all the wiki pages and jira tickets I can find about the new
>> purgatory design in 0.8.2 but nothing stands out as applicable. I'm
>>happy
>> to provide more detailed logs, configuration, etc. if anyone thinks
>>there
>> might be something important in there. I am completely baffled as to
>>what
>> could be causing this.
>>
>> Any suggestions would be appreciated. I'm starting to think at this
>>point
>> that we've completely misunderstood or misconfigured *something*.
>>
>> Thanks,
>> Evan
>>


Mime
View raw message