kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: full disk
Date Mon, 23 Sep 2013 00:10:33 GMT
Paul,

This is likely due to that the log cleaner only runs every
log.cleanup.interval.mins
(defaults to 10) mins. We probably should consider running the cleaner on
startup of a broker. Could you file a jira for that?
Thanks,
Jun


On Sat, Sep 21, 2013 at 12:06 PM, Paul Mackles <pmackles@adobe.com> wrote:

> Hi -
>
> We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran
> out of disk on one of the nodes . As expected, the broker shut itself down
> and all of the clients switched over to the other nodes. So far so good.
>
> To free up disk space, I reduced log.retention.hours to something more
> manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2
> nodes were running OK, I first tried to restart the node which ran out of
> disk. Unfortunately, it kept shutting itself down due to the full disk.
> From the logs, I think this was because it was trying to sync-up the
> replicas it was responsible for and of course couldn't due to the lack of
> disk space. My hope was that upon restart, it would see the new retention
> settings and free up a bunch of disk space before trying to do any syncs.
>
> I then went and restarted the other 2 nodes. They both picked up the new
> retention settings and freed up a bunch of storage as a result. I then went
> back and tried to restart the 3rd node but to no avail. It still had
> problems with the full disks.
>
> I thought about trying to reassign partitions so that the node in question
> had less to manage but that turned out to be a hassle so I wound up
> manually deleting some of the old log/segment files. The broker seemed to
> come back fine after that but that's not something I would want to do on a
> production server.
>
> We obviously need better monitoring/alerting to avoid this situation
> altogether, but I am wondering if the order of operations at startup
> could/should be changed to better account for scenarios like this. Or maybe
> a utility to remove old logs after changing ttl? Did I miss a better way to
> handle this?
>
> Thanks,
> Paul
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message