kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: corrupt recovery checkpoint file issue....
Date Mon, 10 Nov 2014 02:10:08 GMT
Guozhang,

In OffsetCheckpoint.write(), we don't catch any exceptions. There is only a
finally clause to close the writer. So, it there is any exception during
write or sync, the exception will be propagated back to the caller and
swapping will be skipped.

Thanks,

Jun

On Fri, Nov 7, 2014 at 9:47 AM, Guozhang Wang <wangguoz@gmail.com> wrote:

> Jun,
>
> Checking the OffsetCheckpoint.write function, if
> "fileOutputStream.getFD.sync" throws exception it will just be caught and
> forgotten, and the swap will still happen, may be we need to catch the
> SyncFailedException and re-throw it as a FATAIL error to skip the swap.
>
> Guozhang
>
>
> On Thu, Nov 6, 2014 at 8:50 PM, Jason Rosenberg <jbr@squareup.com> wrote:
>
> > I'm still not sure what caused the reboot of the system (but yes it
> appears
> > to have crashed hard).  The file system is xfs, on CentOs linux.  I'm not
> > yet sure, but I think also before the crash, the system might have become
> > wedged.
> >
> > It appears the corrupt recovery files actually contained all zero bytes,
> > after looking at it with odb.
> >
> > I'll file a Jira.
> >
> > On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <junrao@gmail.com> wrote:
> >
> > > I am also wondering how the corruption happened. The way that we update
> > the
> > > OffsetCheckpoint file is to first write to a tmp file and flush the
> data.
> > > We then rename the tmp file to the final file. This is done to prevent
> > > corruption caused by a crash in the middle of the writes. In your case,
> > was
> > > the host crashed? What kind of storage system are you using? Is there
> any
> > > non-volatile cache on the storage system?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jbr@squareup.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > We recently had a kafka node go down suddenly. When it came back up,
> it
> > > > apparently had a corrupt recovery file, and refused to startup:
> > > >
> > > > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> > > > starting up KafkaServer
> > > > java.lang.NumberFormatException: For input string:
> > > >
> > > >
> > >
> >
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> > > >
> > > >
> > >
> >
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
> > > >         at
> > > >
> > >
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > > >         at java.lang.Integer.parseInt(Integer.java:481)
> > > >         at java.lang.Integer.parseInt(Integer.java:527)
> > > >         at
> > > >
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> > > >         at
> > scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> > > >         at
> > kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
> > > >         at
> > > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
> > > >         at
> > > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
> > > >         at
> > > >
> > >
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> > > >         at
> > > > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> > > >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
> > > >         at kafka.log.LogManager.<init>(LogManager.scala:57)
> > > >         at
> > > kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
> > > >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> > > >
> > > > And since the app is under a monitor (so it was repeatedly restarting
> > and
> > > > failing with this error for several minutes before we got to it)…
> > > >
> > > > We moved the ‘recovery-point-offset-checkpoint’ file out of the way,
> > and
> > > it
> > > > then restarted cleanly (but of course re-synced all it’s data from
> > > > replicas, so we had no data loss).
> > > >
> > > > Anyway, I’m wondering if that’s the expected behavior? Or should it
> not
> > > > declare it corrupt and then proceed automatically to an unclean
> > restart?
> > > >
> > > > Should this NumberFormatException be handled a bit more gracefully?
> > > >
> > > > We saved the corrupt file if it’s worth inspecting (although I doubt
> it
> > > > will be useful!)….
> > > >
> > > > Jason
> > > > ​
> > > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message