kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: corrupt recovery checkpoint file issue....
Date Fri, 07 Nov 2014 00:09:37 GMT
I am also wondering how the corruption happened. The way that we update the
OffsetCheckpoint file is to first write to a tmp file and flush the data.
We then rename the tmp file to the final file. This is done to prevent
corruption caused by a crash in the middle of the writes. In your case, was
the host crashed? What kind of storage system are you using? Is there any
non-volatile cache on the storage system?

Thanks,

Jun

On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jbr@squareup.com> wrote:

> Hi,
>
> We recently had a kafka node go down suddenly. When it came back up, it
> apparently had a corrupt recovery file, and refused to startup:
>
> 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> starting up KafkaServer
> java.lang.NumberFormatException: For input string:
>
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>         at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>         at java.lang.Integer.parseInt(Integer.java:481)
>         at java.lang.Integer.parseInt(Integer.java:527)
>         at
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>         at
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>         at
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>         at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>         at kafka.log.LogManager.<init>(LogManager.scala:57)
>         at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
>
> And since the app is under a monitor (so it was repeatedly restarting and
> failing with this error for several minutes before we got to it)…
>
> We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it
> then restarted cleanly (but of course re-synced all it’s data from
> replicas, so we had no data loss).
>
> Anyway, I’m wondering if that’s the expected behavior? Or should it not
> declare it corrupt and then proceed automatically to an unclean restart?
>
> Should this NumberFormatException be handled a bit more gracefully?
>
> We saved the corrupt file if it’s worth inspecting (although I doubt it
> will be useful!)….
>
> Jason
> ​
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message