kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <wangg...@gmail.com>
Subject Re: Sporadic Shrinking/Expanding ISR Blips
Date Mon, 29 Jun 2015 18:08:34 GMT
Hi Christofer,

Does this ISR churn usually come with incoming traffic spikes? If yes I
would suggest you increase both replica.lag.time.max.ms and
replica.lag.max.messages and see if it helps. The SocketTimeoutException is
normal when the followers consumer gets disconnected from the leader.

In the coming 0.8.3 release there is a new featured adde (KAFKA-1546
<https://issues.apache.org/jira/browse/KAFKA-1546>) that handles traffic
spikes better.

Guozhang

On Mon, Jun 29, 2015 at 10:19 AM, Christofer Hedbrandh <
christofer@knewton.com> wrote:

> Hi Kafka users,
>
> Every ~3-4 days we are seeing a broker logging "Shrinking ISR for
> partition..." only to log "Expanding ISR for partition..." a few seconds
> later.
>
> - The broker logging "Shrinking ISR..." is the leader of all partitions for
> which the ISR is shrunk.
> - The number of partitions is always a multiple of the total number of
> partitions it is leading, divided by the number of fetcher threads.
> - There seem to be no particular pattern to which broker kicks which broker
> out of the ISR.
> - After the broker has logged "Shrinking ISR..." for all affected
> partitions, it logs "Closing socket connection to..." for the broker that
> was kicked out, and then "Expanding ISR..." begins.
> - The broker that got kicked out logs "Reconnect due to socket error:
> java.net.SocketTimeoutException" when the leader broker closes the
> connection.
> - Consumers sometimes show latency spikes when this happens.
>
>
> Example leader log:
> ...Shrinking...
> [2015-06-28 05:24:51,141] INFO Partition [my-topic,24] on broker 762855627:
> Shrinking ISR for partition [my-topic,24] from
> 762855627,729301197,1249886642 to 762855627,729301197
> (kafka.cluster.Partition)
> [2015-06-28 05:24:51,149] INFO Partition [my-topic,16] on broker 762855627:
> Shrinking ISR for partition [my-topic,16] from
> 762855627,729301197,1249886642 to 762855627,729301197
> (kafka.cluster.Partition)
> [2015-06-28 05:25:03,295] INFO Closing socket connection to /10.95.197.138
> .
> (kafka.network.Processor)
> [2015-06-28 05:25:03,301] INFO Partition [my-topic,60] on broker 762855627:
> Expanding ISR for partition [my-topic,60] from 762855627,729301197 to
> 762855627,729301197,1249886642 (kafka.cluster.Partition)
> [2015-06-28 05:25:03,339] INFO Partition [my-topic,206] on broker
> 762855627: Expanding ISR for partition [my-topic,206] from
> 762855627,729301197 to 762855627,729301197,1249886642
> (kafka.cluster.Partition)
> ...Expanding...
>
> Example follower log:
> [2015-06-28 05:25:03,299] INFO Reconnect due to socket error:
> java.net.SocketTimeoutException (kafka.consumer.SimpleConsumer)
>
>
> I am tempted to just increase "replica.lag.time.max.ms" from the default
> 10
> seconds that we are currently using, but the SocketTimeoutException is
> suggesting that there may be something else going on here. It also
> surprises me that consumers are affected by this, since no broker partition
> leadership re-election is taking place. Any advice on how to proceed would
> be much appreciated.
>
>
> The setup:
> Three Kafka 0.8.2 brokers running on AWS EC2 c3.2xlarge instances.
> Three ZooKeepers 3.3.6 running on AWS EC2 m1.large instances.
>
> Broker configs:
> auto.leader.rebalance.enable=true
> log.segment.bytes=536870912
> socket.receive.buffer.bytes=1048576
> socket.send.buffer.bytes=1048576
> log.flush.scheduler.interval.ms=3000
> log.flush.interval.ms=3000
> controlled.shutdown.enable=true
> delete.topic.enable=true
> zookeeper.connection.timeout.ms=1000000
> replica.fetch.max.bytes=10485760
> num.replica.fetchers=4
> log.dirs=XXX
> log.retention.minutes=5256000
> message.max.bytes=10485760
> log.flush.interval.messages=10000
> auto.create.topics.enable=false
> port=9092
> zookeeper.connect=XXX
> broker.id=XXX
> log.roll.hours=24
>
> Peak load:
> total messages per second: 80
> total bytes in per second: 110 KiB
> total bytes out per second: 440 KiB (220 of which is replication)
>
>
> Thanks,
> Christofer
>



-- 
-- Guozhang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message