kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Banker <kyleban...@gmail.com>
Subject Re: ReplicaFetcherThread Error, Massive Logging, and Leader Flapping
Date Wed, 10 Jun 2015 16:54:58 GMT
Just for sake of future forums readers, the solution to the "leader
flapping" problem I described was to increase the zookeeper session timeout
setting (zookeeper.session.timeout.ms). I believe we doubled it (15000ms to
30000ms).

For the ReplicaFetcherThread and
https://issues.apache.org/jira/browse/KAFKA-1196 issue(s), there is a
technique for recovery, but it's a bit tedious. Basically, you have to
continually restart your Kafka broker with ever greater values for
message.max.bytes and replica.fetch.max.bytes. This way, the broker can
catch up without hitting the overflow bug.

I believe this will work as long as your topic contains non-unformly-sized
messages. In other words, if the messages in your topics are all the same,
large size, then I assume that you will eventually hit a threshold for
message.max.bytes that brings in enough data to trigger the overflow bug.

On Mon, Apr 20, 2015 at 3:46 PM, Kyle Banker <kylebanker@gmail.com> wrote:

> Hi Jiangjie,
>
> There's is nothing of note in the controller log. I've attached that log
> along with the state change log in the following gist:
> https://gist.github.com/banker/78b56a3a5246b25ace4c
>
> This represents a 2-hour period on April 15th.
>
> Since I've disabled the broker on question (on April 15th), there's been
> no change to the state-change logs across the entire cluster. While the
> broker was on, as you can see, state-change.log was growing massively, and
> the broker was exhibiting the "flapping" I've described.
>
> Note that I have auto.leader.rebalance.enable set to true for the entire
> cluster. Are there any known bugs associated with this feature?
>
> Many thanks.
>
> On Thu, Apr 16, 2015 at 2:19 PM, Jiangjie Qin <jqin@linkedin.com.invalid>
> wrote:
>
>> It seems there are many different symptoms you see...
>> Maybe we can start from leader flapping issue. Any findings in controller
>> log?
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>> On 4/16/15, 12:09 PM, "Kyle Banker" <kylebanker@gmail.com> wrote:
>>
>> >Hi,
>> >
>> >I've run into a pretty serious production issue with Kafka 0.8.2, and I'm
>> >wondering what my options are.
>> >
>> >
>> >ReplicaFetcherThread Error
>> >
>> >I have a broker on a 9-node cluster that went down for a couple of hours.
>> >When it came back up, it started spewing constant errors of the following
>> >form:
>> >
>> >INFO Reconnect due to socket error:
>> >java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)
>> >[2015-04-09 22:38:54,580] WARN [ReplicaFetcherThread-0-7], Error in fetch
>> >Name: FetchRequest; Version: 0; CorrelationId: 767; ClientId:
>> >ReplicaFetcherThread-0-7; ReplicaId: 1; MaxWait: 500 ms; MinBytes: 1
>> >bytes;
>> >RequestInfo: [REDACTED] Possible cause: java.io.EOFException: Received -1
>> >when reading from channel, socket has likely been closed.
>> >(kafka.server.ReplicaFetcherThread)
>> >
>> >
>> >Massive Logging
>> >
>> >This produced around 300GB of new logs in a 24-hour period and rendered
>> >the
>> >broker completely unresponsive.
>> >
>> >This broker hosts about 500 partitions spanning 40 or so topics (all
>> >topics
>> >have a replication factor of 3). One topic contains messages up to 100MB
>> >in
>> >size. The remaining topics have messages no larger than 10MB.
>> >
>> >It appears that I've hit this bug:
>> >https://issues.apache.org/jira/browse/KAFKA-1196
>> >
>> >
>> >"Leader Flapping"
>> >
>> >I can get the broker to come online without logging massively by reducing
>> >both max.message.bytes and replica.fetch.max.bytes to ~10MB. It then
>> >starts
>> >resyncing all but the largest topic.
>> >
>> >Unfortunately, it also starts "leader flapping." That is, it continuously
>> >acquires and relinquishes partition leadership. There is nothing of note
>> >in
>> >the logs while this is happening, but the consumer offset checker clearly
>> >shows this. The behavior significantly reduces cluster write throughput
>> >(since producers are constantly failing).
>> >
>> >The only solution I have is to leave the broker off. Is this a known
>> >"catch-22" situation? Is there anything that can be done to fix it?
>> >
>> >Many thanks in advance.
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message