kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: full disk
Date Tue, 24 Sep 2013 03:41:24 GMT
Yes, manually removing the old log files is the simplest solution right now.



On Mon, Sep 23, 2013 at 9:16 AM, Paul Mackles <pmackles@adobe.com> wrote:

> Done:
> https://issues.apache.org/jira/browse/KAFKA-1063
> Out of curioisity, is manually removing the older log files the only
> option at this point?
> From: Paul Mackles <pmackles@adobe.com<mailto:pmackles@adobe.com>>
> To: "users@kafka.apache.org<mailto:users@kafka.apache.org>" <
> users@kafka.apache.org<mailto:users@kafka.apache.org>>
> Subject: full disk
> Hi -
> We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran
> out of disk on one of the nodes . As expected, the broker shut itself down
> and all of the clients switched over to the other nodes. So far so good.
> To free up disk space, I reduced log.retention.hours to something more
> manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2
> nodes were running OK, I first tried to restart the node which ran out of
> disk. Unfortunately, it kept shutting itself down due to the full disk.
> From the logs, I think this was because it was trying to sync-up the
> replicas it was responsible for and of course couldn't due to the lack of
> disk space. My hope was that upon restart, it would see the new retention
> settings and free up a bunch of disk space before trying to do any syncs.
> I then went and restarted the other 2 nodes. They both picked up the new
> retention settings and freed up a bunch of storage as a result. I then went
> back and tried to restart the 3rd node but to no avail. It still had
> problems with the full disks.
> I thought about trying to reassign partitions so that the node in question
> had less to manage but that turned out to be a hassle so I wound up
> manually deleting some of the old log/segment files. The broker seemed to
> come back fine after that but that's not something I would want to do on a
> production server.
> We obviously need better monitoring/alerting to avoid this situation
> altogether, but I am wondering if the order of operations at startup
> could/should be changed to better account for scenarios like this. Or maybe
> a utility to remove old logs after changing ttl? Did I miss a better way to
> handle this?
> Thanks,
> Paul

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message