kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Szewczyk <jacek7...@gmail.com>
Subject Re: Kafka CPU spikes every 5 minutes
Date Thu, 09 Apr 2020 21:14:43 GMT
Never mind, I found answer. I had unexpected cron on firing up every 5 minutes and blast cluster
with connections from +2k additional servers.


> On Apr 8, 2020, at 10:46, Jacek Szewczyk <jacek7989@gmail.com> wrote:
> 
> Hi All,
>  
> I am seeing strange behavior for Kafka 2.0.0.3.1.4. My cluster contains 9 brokers + 3
dedicated zookeepers and for unknown reason there is spike in CPU every 5 minutes which cause
timeouts between producers, consumers and brokers. Basically every 5 minutes CPU spikes to
90+ % and at the same time network utilization goes down to almost 0 (and should be in 100MBps
range)
> Each broker has 64G of memory, Heap set to 9G, there are 8 cores and 4G uplink. I have
500 partitions (replication=2)  and there were around 1000 producers sending data in 1 minute
intervals. Summary input rate is around 600k/s
> Here is my config:
>  
> auto.create.topics.enable=false
> auto.leader.rebalance.enable=true
> compression.type=producer
> controlled.shutdown.enable=true
> controlled.shutdown.max.retries=3
> controlled.shutdown.retry.backoff.ms=5000
> controller.message.queue.size=10
> controller.socket.timeout.ms=30000
> default.replication.factor=2
> delete.topic.enable=true
> leader.imbalance.check.interval.seconds=300
> leader.imbalance.per.broker.percentage=10
> listeners=PLAINTEXT://localhost:9092 <plaintext://localhost:9092>
> log.cleanup.interval.mins=10
> log.dirs=/diskc/kafka-logs,/diskd/kafka-logs,/diske/kafka-logs,/diskf/kafka-logs,/diskg/kafka-logs,/diskh/kafka-logs,/diskj/kafka-logs,/diskk/kafka-logs
> log.index.interval.bytes=4096
> log.index.size.max.bytes=10485760
> log.retention.bytes=-1
> log.retention.check.interval.ms=600000
> log.retention.hours=24
> log.roll.hours=24
> log.segment.bytes=1073741824
> message.max.bytes=1000000
> min.insync.replicas=1
> num.io.threads=8
> num.network.threads=3000
> num.partitions=100
> num.recovery.threads.per.data.dir=4
> num.replica.fetchers=4
> offset.metadata.max.bytes=4096
> offsets.commit.required.acks=-1
> offsets.commit.timeout.ms=5000
> offsets.load.buffer.size=5242880
> offsets.retention.check.interval.ms=600000
> offsets.retention.minutes=86400000
> offsets.topic.compression.codec=0
> offsets.topic.num.partitions=50
> offsets.topic.replication.factor=3
> offsets.topic.segment.bytes=104857600
> producer.metrics.enable=false
> producer.purgatory.purge.interval.requests=10000
> queued.max.requests=500
> replica.fetch.max.bytes=1048576
> replica.fetch.min.bytes=1
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.lag.max.messages=4000
> replica.lag.time.max.ms=10000
> replica.socket.receive.buffer.bytes=65536
> replica.socket.timeout.ms=30000
> sasl.enabled.mechanisms=GSSAPI
> sasl.mechanism.inter.broker.protocol=GSSAPI
> security.inter.broker.protocol=PLAINTEXT
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
> zookeeper.connection.timeout.ms=25000
> zookeeper.session.timeout.ms=30000
> zookeeper.sync.time.ms=2000
>  
> In log messages for every spike it starts with shrinking replication and continues with
timeouts like this:
> INFO [Partition partition-220 broker=1010] Shrinking ISR from 1010,1006 to 1010 (kafka.cluster.Partition)
> And ton of messages:
> WARN Attempting to send response via channel for which there is no open connection, connection
IP:9092-IP:45520-1 (kafka.network.Processor)
> WARN [ReplicaFetcher replicaId=1010, leaderId=1006, fetcherId=0] Error in response for
fetch request (type=FetchRequest, replicaId=1010, maxWait=500, minBytes=1, maxBytes=10485760,
fetchData={topic-312=(offset=984223132, logStartOffset=916099079, maxBytes=1048576)}, isolationLevel=READ_UNCOMMITTED,
toForget=, metadata=(sessionId=1467131318, epoch=5031)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 1006 was disconnected before the response was read
>         at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
>         at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
>         at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
>         at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
>         at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
>         at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
> I’ve also tried topics with no replication even for __consumer_offsets but result was
the same. The only setting which makes difference is less nr of partitions, so if changed
from 500 to 200 it is more stable but still 5m spike exists. 
> I’ve played around with multiple settings and issue persist no matter what.
>  
> I would be grateful if anyone can comment on cpu spikes and shed some light how to fix/improve
it.
>  
> Thanks,
> Jacek


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message