kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Miller <st...@idrathernotsay.com>
Subject Re: Debugging high log flush latency on a broker.
Date Tue, 22 Sep 2015 19:34:46 GMT
   There may be more elegant ways to do this, but I'd think that you could just ls all the
directories specified in log.dirs in your server.properties file for Kafka.  You should see
directories for each topicname-partitionnumber there.

   Offhand it sounds to me like maybe something's evicting pages from the buffer cache from
time to time, causing Kafka to do a lot more I/O all of a sudden than usual.  Why that happens,
I don't know, but that'd be my guess: either something needs more pages for applications all
of a sudden, or like you said, there's some characteristic of the traffic for the partitions
on this broker that isn't the same as it is for all the other brokers.

   Filesystem type and creation parameters are the same as on the other hosts?  sysctl stuff
all tuned the same way (assuming this is Linux, that is)?

   Any chance there's some sort of network hiccup that makes some follower get a little behind,
and then the act of it trying to catch back up pushes the I/O past what it can sustain steady-state?
 (If something gets significantly behind, depending on the size of your buffer cache relative
to the retention in your topics, you could have something, say, start reading from the first
offset in that topic and partition, which might well require going to disk rather than being
satisfied from the buffer cache.  I could see that slowing I/O enough, if it's on the edge
otherwise, that now you can't keep up with the write rate until that consumer gets caught

   The other idea would be that, I dunno, maybe there's topic where the segment size is different,
and so when it goes to delete a segment it's spending a lot more time putting blocks from
that file back onto the filesystem free list (or whatever data structure it is these days
(-: ).


On Tue, Sep 22, 2015 at 11:46:49AM -0700, Rajiv Kurian wrote:
> Also any hints on how I can find the exact topic/partitions assigned to
> this broker? I know in ZK we can see the partition -> broker mapping, but I
> am looking for a broker -> partition mapping. I can't be sure if the load
> that is causing this problem is because of leader traffic or follower
> traffic. What is weird is that I rarely if ever see other brokers in the
> cluster have the same problem. With 3 way replication (leader + 2 replicas)
> I'd imagine that the same work load would cause problems on other brokers
> too.

View raw message