You can also look into the producer time metrics and see if the time goes up before the spike of the request rate. If the time does go up, you can look at the breakdown of the time. Any latency due to local I/O will be included in the local time.

Thanks,

Jun

On Tue, Nov 11, 2014 at 10:50 AM, Wes Chow <wes@chartbeat.com> wrote:

We're seeing periodic spikes in req/sec rates across our nodes. Our cluster is 10 nodes, and the topic has a replication factor of 3. We push around 200k messages / sec into Kafka.


The machines are running the most recent version of Kafka and we're connecting via librdkafka. pingstream02-10 are using the CMS garbage collector, but I switched pingstream01 to use G1GC under the theory that maybe these were GC pauses. The graph shows that likely didn't improve the situation.

My next thought is that maybe this is the effect of log rolling. Checking in the logs, I see a lot of this:

[2014-11-11 13:46:45,836] 72952071 [ReplicaFetcherThread-0-7] INFO  kafka.log.Log  - Rolled new log segment for 'pings-342' in 3 ms.
[2014-11-11 13:46:47,116] 72953351 [kafka-request-handler-0] INFO  kafka.log.Log  - Rolled new log segment for 'pings-186' in 2 ms.
[2014-11-11 13:46:48,155] 72954390 [ReplicaFetcherThread-0-8] INFO  kafka.log.Log  - Rolled new log segment for 'pings-253' in 3 ms.
[2014-11-11 13:46:48,408] 72954643 [ReplicaFetcherThread-0-4] INFO  kafka.log.Log  - Rolled new log segment for 'pings-209' in 3 ms.
[2014-11-11 13:46:48,436] 72954671 [ReplicaFetcherThread-0-4] INFO  kafka.log.Log  - Rolled new log segment for 'pings-299' in 2 ms.
[2014-11-11 13:46:48,687] 72954922 [kafka-request-handler-0] INFO  kafka.log.Log  - Rolled new log segment for 'pings-506' in 2 ms.

The "pings" topic in question has 512 partitions, so it does this 512 times every so often. We have an effective retention period of a bit less than 30 min, so rolling happens pretty frequently. Still, if I assume worst case that rolling locks up the process for 2ms and there are 512 rolls every few minutes, I'd expect halting to happen for about a second at a time. The graphs seem to indicate much longer dips, but it's hard for me to know if I'm looking at real data or some sort of artifact.

Fwiw, the producers are not reporting any errors, so it does not seem like we're losing data.

I'm new to Kafka. Should I be worried? If so, how should I be debugging this?

Thanks,
Wes