kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M. Manna" <manme...@gmail.com>
Subject Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows
Date Sun, 13 May 2018 09:00:48 GMT
Hi Ted,

I highly appreciate the response over the weekend, and thanks for pointing
out the JIRAs.

I don't believe the processes are responsible, but individual threads which
are still holding the log/index files using IO streams. I am trying walk a
single node setup through debugger to find out which thread is locking the
file. Apologise but it's a huge application so it might get me some time to
get around it :)

Please do update if you find something new.

Regards,

On 12 May 2018 at 22:15, Ted Yu <yuzhihong@gmail.com> wrote:

> There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which bear some
> resemblance.
>
> Can you try to find how the other process uses the file being deleted ?
>
> https://superuser.com/questions/117902/find-out-
> which-process-is-locking-a-file-or-folder-in-windows
> https://www.computerhope.com/issues/ch000714.htm
>
> Cheers
>
> On Sat, May 12, 2018 at 1:42 PM, M. Manna <manmedia@gmail.com> wrote:
>
> > Hello,
> >
> > We are still stuck with this issue where 2.11-1.1.0 distro is failing to
> > cleanup logs on Windows and brings the entire cluster down one by one.
> > Extending the retention hours and sizes don't help because they burden
> the
> > hard drive.
> >
> > Here is the log
> >
> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
> Rolled
> > > new log segment at offset 45 in 105 ms. (kafka.log.Log)
> > > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
> > > Scheduling log segment [baseOffset 0, size 2290] for deletion.
> > > (kafka.log.Log)
> > > [2018-05-12 21:36:57,673] ERROR Error while deleting segments for
> test-0
> > > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
> > > java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >         at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >         at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> > >         at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >         at java.nio.file.Files.move(Files.java:1395)
> > >         at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:697)
> > >         at
> > > org.apache.kafka.common.record.FileRecords.renameTo(
> > FileRecords.java:212)
> > >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.
> scala:415)
> > >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
> > 1601)
> > >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > scala.collection.mutable.ResizableArray$class.foreach(
> > ResizableArray.scala:59)
> > >         at
> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> > nts(Log.scala:1228)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:854)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:852)
> > >         at
> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> > TraversableLike.scala:733)
> > >         at scala.collection.immutable.List.foreach(List.scala:392)
> > >         at
> > > scala.collection.TraversableLike$WithFilter.
> > foreach(TraversableLike.scala:732)
> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> > >         at
> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> > LogManager.scala:385)
> > >         at
> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> > KafkaScheduler.scala:110)
> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> > >         at
> > > java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> > >         at java.util.concurrent.FutureTask.runAndReset(
> > FutureTask.java:308)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1149)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:624)
> > >         at java.lang.Thread.run(Thread.java:748)
> > >         Suppressed: java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >                 at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >                 at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >                 at
> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> > >                 at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >                 at java.nio.file.Files.move(Files.java:1395)
> > >                 at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:694)
> > >                 ... 32 more
> > > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping
> serving
> > > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> > > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled task
> > > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
> > > org.apache.kafka.common.errors.KafkaStorageException: Error while
> > deleting
> > > segments for test-0 in dir C:\kafka1
> > > Caused by: java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >         at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >         at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> > >         at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >         at java.nio.file.Files.move(Files.java:1395)
> > >         at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:697)
> > >         at
> > > org.apache.kafka.common.record.FileRecords.renameTo(
> > FileRecords.java:212)
> > >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.
> scala:415)
> > >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
> > 1601)
> > >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > scala.collection.mutable.ResizableArray$class.foreach(
> > ResizableArray.scala:59)
> > >         at
> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> > nts(Log.scala:1228)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:854)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:852)
> > >         at
> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> > TraversableLike.scala:733)
> > >         at scala.collection.immutable.List.foreach(List.scala:392)
> > >         at
> > > scala.collection.TraversableLike$WithFilter.
> > foreach(TraversableLike.scala:732)
> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> > >         at
> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> > LogManager.scala:385)
> > >         at
> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> > KafkaScheduler.scala:110)
> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> > >         at
> > > java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> > >         at java.util.concurrent.FutureTask.runAndReset(
> > FutureTask.java:308)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1149)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:624)
> > >         at java.lang.Thread.run(Thread.java:748)
> > >         Suppressed: java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >                 at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >                 at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >                 at
> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> > >                 at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >                 at java.nio.file.Files.move(Files.java:1395)
> > >                 at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:694)
> > >                 ... 32 more
> > > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1]
> > Removed
> > > fetcher for partitions
> > > __consumer_offsets-22,__consumer_offsets-30,__
> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> > consumer_offsets-16,test-0,__consumer_offsets-28,__
> consumer_offsets-31,__
> > consumer_offsets-36,__consumer_offsets-42,__
> consumer_offsets-3,__consumer_
> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> > consumer_offsets-40
> > > (kafka.server.ReplicaFetcherManager)
> > > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on broker 1]
> > > Removed fetcher for partitions
> > > __consumer_offsets-22,__consumer_offsets-30,__
> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> > consumer_offsets-16,test-0,__consumer_offsets-28,__
> consumer_offsets-31,__
> > consumer_offsets-36,__consumer_offsets-42,__
> consumer_offsets-3,__consumer_
> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> > consumer_offsets-40
> > > (kafka.server.ReplicaAlterLogDirsManager)
> > > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1
> stopped
> > > fetcher for partitions
> > > __consumer_offsets-22,__consumer_offsets-30,__
> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> > consumer_offsets-16,test-0,__consumer_offsets-28,__
> consumer_offsets-31,__
> > consumer_offsets-36,__consumer_offsets-42,__
> consumer_offsets-3,__consumer_
> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> > consumer_offsets-40
> > > and stopped moving logs for partitions  because they are in the failed
> > log
> > > directory C:\kafka1. (kafka.server.ReplicaManager)
> > > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir C:\kafka1
> > > (kafka.log.LogManager)
> > > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log dirs in
> > > C:\kafka1 have failed (kafka.log.LogManager)
> > >
> >
> > Could someone please share some ideas how to rectify this on Windows? If
> > this will never be supported on Windows, could we get some official
> > communication perhaps?
> >
> > Regards,
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message