storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 王 纯超 <wangchunc...@outlook.com>
Subject Worker Restart without Error
Date Tue, 27 Mar 2018 02:23:33 GMT
Hi all,

I am now experiencing the problem that a worker restarts periodically. I checked the worker.log,
and found below information when the worker shuts down(I do not know exactly before or after.
My guess is before and is the cause of worker shutdown):


2018-03-27 09:56:29.331 o.a.s.k.KafkaUtils Thread-63-reader-Tsignalmsg_4-executor[83 83] [INFO]
Task [5/10] assigned [Partition{host=132.98.23.31:9092, topic=Tsignalmsg_4, partition=4}]
2018-03-27 09:56:29.331 o.a.s.k.ZkCoordinator Thread-63-reader-Tsignalmsg_4-executor[83 83]
[INFO] Task [5/10] Deleted partition managers: []
2018-03-27 09:56:29.331 o.a.s.k.ZkCoordinator Thread-63-reader-Tsignalmsg_4-executor[83 83]
[INFO] Task [5/10] New partition managers: []
2018-03-27 09:56:29.331 o.a.s.k.ZkCoordinator Thread-63-reader-Tsignalmsg_4-executor[83 83]
[INFO] Task [5/10] Finished refreshing
2018-03-27 09:56:32.666 o.e.t.n.Netty4Utils elasticsearch[_client_][transport_client_boss][T#1]
[ERROR] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
        at org.elasticsearch.transport.netty4.Netty4Transport.lambda$sendMessage$4(Netty4Transport.java:402)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:905)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:805)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1291)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
        at io.netty.handler.logging.LoggingHandler.write(LoggingHandler.java:249)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.write(Netty4MessageChannelHandler.java:56)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1089)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1136)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1078)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)
2018-03-27 09:56:32.675 STDIO Thread-80 [ERROR] Halting due to Out Of Memory Error...Thread-80
2018-03-27 09:56:38.560 STDERR Thread-1 [INFO] Java HotSpot(TM) 64-Bit Server VM warning:
Using the ParNew young collector with the Serial old collector is deprecated and will likely
be removed in a future release
2018-03-27 09:56:40.361 o.a.s.d.worker main [INFO] Launching worker for ESTopology-1-1522058870
on a1cd10fc-dcbd-4e56-a18b-41b6e89985ac:6704

Below is the GC log:

2018-03-27T09:55:36.823+0800: 462.021: [GC (Allocation Failure) 2018-03-27T09:55:36.823+0800:
462.022: [ParNew: 3774912K->317486K(3774912K), 0.1497501 secs] 4704499K->1666793K(10066368K),
0.1503144 secs] [Times: user=1.22 sys=0.65, real=0.15 secs]
2018-03-27T09:56:29.597+0800: 514.795: [GC (Allocation Failure) 2018-03-27T09:56:29.597+0800:
514.796: [ParNew: 3673006K->315626K(3774912K), 0.1109489 secs] 5022313K->1664933K(10066368K),
0.1114803 secs] [Times: user=1.37 sys=0.00, real=0.11 secs]
Heap
 par new generation   total 3774912K, used 2143135K [0x0000000540000000, 0x0000000640000000,
0x0000000640000000)
  eden space 3355520K,  54% used [0x0000000540000000, 0x00000005af8ad4c0, 0x000000060cce0000)
  from space 419392K,  75% used [0x0000000626670000, 0x0000000639aaa870, 0x0000000640000000)
  to   space 419392K,   0% used [0x000000060cce0000, 0x000000060cce0000, 0x0000000626670000)
 tenured generation   total 6291456K, used 1349307K [0x0000000640000000, 0x00000007c0000000,
0x00000007c0000000)
   the space 6291456K,  21% used [0x0000000640000000, 0x00000006925aed18, 0x00000006925aee00,
0x00000007c0000000)
 Metaspace       used 67106K, capacity 580218K, committed 580352K, reserved 2097152K
  class space    used 10503K, capacity 10993K, committed 11008K, reserved 1048576K
tail: gc.log.0.current:文件已截断
Java HotSpot(TM) 64-Bit Server VM (25.141-b15) for linux-amd64 JRE (1.8.0_141-b15), built
on Jul 12 2017 04:21:34 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 264403580k(155999164k free), swap 67108856k(67066856k free)
CommandLine flags: -XX:+DisableExplicitGC -XX:GCLogFileSize=1048576 -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=artifacts/heapdump -XX:InitialBootClassLoaderMetaspaceSize=536870912 -XX:InitialHeapSize=10737418240
-XX:+ManagementServer -XX:MaxDirectMemorySize=134217728 -XX:MaxHeapSize=10737418240 -XX:MaxMetaspaceFreeRatio=80
-XX:MaxNewSize=4294967296 -XX:MetaspaceSize=1073741824 -XX:NewSize=4294967296 -XX:NumberOfGCLogFiles=10
-XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers
-XX:+UseCompressedOops -XX:+UseGCLogFileRotation -XX:+UseParNewGC
2018-03-27T09:56:41.561+0800: 3.529: [GC (Allocation Failure) 2018-03-27T09:56:41.561+0800:
3.529: [ParNew: 3355520K->34156K(3774912K), 0.0419198 secs] 3355520K->34156K(10066368K),
0.0421130 secs] [Times: user=0.33 sys=0.03, real=0.04 secs]
2018-03-27T09:56:42.847+0800: 4.815: [GC (Allocation Failure) 2018-03-27T09:56:42.848+0800:
4.816: [ParNew: 3389676K->37180K(3774912K), 0.0218977 secs] 3389676K->37180K(10066368K),
0.0222756 secs] [Times: user=0.26 sys=0.03, real=0.02 secs]

the last moment of the worker:
[cid:_Foxmail.1@b8560c06-9ef1-83aa-9b61-8d53201e9e38]


the JVM arguments:
-Dlogging.sensitivity=S3
-Dlogfile.name=worker.log
-Dstorm.home=/opt/apache-storm-1.1.0
-Dworkers.artifacts=/home/hadoop_stq_A/srv/storm/logs/workers-artifacts
-Dstorm.id=ESTopology-1-1522058870
-Dworker.id=f7c73b57-2fc7-4453-ad86-d0b491c4ffd7
-Dworker.port=6704
-Dstorm.log.dir=/home/hadoop_stq_A/srv/storm/logs
-Dlog4j.configurationFile=/opt/apache-storm-1.1.0/log4j2/worker.xml
-DLog4jContextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector
-Dstorm.local.dir=/home/hadoop_stq_A/srv/storm/storm-local-data/
-XX:+DisableExplicitGC
-Xmx10g
-Xms10g
-XX:MetaspaceSize=1g
-XX:InitialBootClassLoaderMetaspaceSize=512m
-XX:MaxMetaspaceFreeRatio=80
-XX:MaxDirectMemorySize=128m
-XX:NewSize=4g
-XX:MaxNewSize=4g
-XX:+PrintGCDetails
-Xloggc:artifacts/gc.log
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+UseParNewGC
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=1M
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=artifacts/heapdump
-Dcom.sun.management.jmxremote
-Dcom.sun.managemen

So my question is:
1. How does OOM occur since the worker heap is abundant?
2. Does OOM lead to worker restart or topology failure?

In addition, any information helpful in identifying the root case and solving the problem
is quite appreciated.

________________________________
wangchunchao@outlook.com
Mime
View raw message