storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabh...@gmail.com>
Subject Re: ShellBolt raise subprocess heartbeat timeout Exception
Date Mon, 24 Oct 2016 09:14:06 GMT
Sorry but you seem to run multiple ShellBolts in one worker, so I can't see
which threads print out which log.
Could you add %t to worker log format? You can just copy the content from
below link to your worker.xml.
https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml

And there's actual heartbeat timeout occurred:
2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000

1477297083673 - 1477296962746 = 120927 > 120000

Thanks,
Jungtaek Lim (HeartSaVioR)


2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <mazhechaomaillist@gmail.com>님이
작성:

> I find something strange. I look at a worker.log, there three subprocess
> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>
>
> ...
> ...
> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678,
last heartbeat : 1477297079165, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850,
last heartbeat : 1477297079644, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678,
last heartbeat : 1477297080165, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850,
last heartbeat : 1477297080814, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678,
last heartbeat : 1477297081165, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850,
last heartbeat : 1477297081816, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat
request to subprocess
> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678,
last heartbeat : 1477297082166, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850,
last heartbeat : 1477297082843, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678,
last heartbeat : 1477297083166, worker timeout (ms) : 120000
>
>
> And then, heartbaet timeout just at 16:18:03.675, before the third
> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
> How could that happen?
>
> ...
> ...
> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243,
name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog
exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439)
~[storm-core-1.0.1.jar:1.0.1]
>         ... 6 more
> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
>         ... 1 more
> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
>         ... 1 more
> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243,
name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
>
>
>
> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:
>
> I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
> 30000ms, everything looks OK.
>
> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:
>
> I will try to do this, and reply latter. Thanks.
>
> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>
> Could you modify your log level to DEBUG and see worker's log? If you use
> Storm 1.x you can modify log level from UI on the fly.
> ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG
> since it could produce lots of logs.
>
> Two lines:
> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
> - BOLT - sending heartbeat request to subprocess
>
> Two lines will be logged to each 1 second. Please check logs are existing,
> and 'last heartbeat' is updated properly, and also worker timeout is set
> properly.
>
> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <mazhechaomaillist@gmail.com>님이
작성:
>
> I do not set "topology.subprocess.timeout.secs", so "
> supervisor.worker.timeout.secs" will be used according to STORM-1314,
> which is set 30 for my cluster.
> 30 seconds is a very very very big value, it will never take more than 30
> seconds processing my tuple.
> I think there must be problem somewhere else.
>
> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>
> There're many situations for ShellBolt to trigger heartbeat issue, and at
> least STORM-1946 is not the case.
>
> How long does your tuple take to be processed? You need to set subprocess
> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
> time to process. You can even set it fairly big value so that subprocess
> heartbeat issue will not happen.
>
>
> ShellBolt requires that each tuple is handled and acked within heartbeat
> timeout. I struggled to change this behavior for subprocess to periodically
> sends heartbeat, but no luck because of GIL - global interpreter lock (same
> for Ruby). We need to choose one: stick this restriction, or disable
> subprocess heartbeat.
>
> I hope that we can resolve this issue clearly, but I guess multi-thread
> approach doesn't work on Python, Ruby, and any language which uses GIL, and
> I have no idea on alternatives
>
> - Jungtaek Lim (HeartSaVioR).
>
> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <mazhechaomaillist@gmail.com>님이
작성:
>
> I made an issue (STORM-2150
> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
> help?
>
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
>
> java.lang.RuntimeException: subprocess heartbeat timeout at
>
> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
>
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
>
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.
>
>
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>

Mime
View raw message