storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabh...@gmail.com>
Subject Re: ShellBolt raise subprocess heartbeat timeout Exception
Date Tue, 25 Oct 2016 13:35:59 GMT
First of all, ShellBolt writes to stdout and reads to stdin which are not
thread-safe, so all of things are written based on this restriction.

setHeartbeat() is called when subprocess sends any messages which shows
that subprocess is still working. That's why subprocess should send any
messages (whether it responds heartbeat ping, or emit messages, or sync)
within heartbeat timeout.

- Jungtaek Lim (HeartSaVioR)
On Tue, 25 Oct 2016 at 8:31 PM Zhechao Ma <mazhechaomaillist@gmail.com>
wrote:

> Besides, I have another question.
>
> In ShellBolt.java, the method setHeartbeat() is called in method run() of
> class BoltReaderRunnable, but why not called in the same place in class
> BoltWriterRunnable? In other words, when should setHeartbeat() be called?
>
> 2016-10-24 18:32 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:
>
> Really, there's timeout ! Last heartbeat doesn't update properly. I'll try
> to adjust worker log format to determin which thread.
> Thanks.
>
> 2016-10-24 17:14 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>
> Sorry but you seem to run multiple ShellBolts in one worker, so I can't
> see which threads print out which log.
> Could you add %t to worker log format? You can just copy the content from
> below link to your worker.xml.
> https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
>
> And there's actual heartbeat timeout occurred:
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>
> 1477297083673 - 1477296962746 = 120927 > 120000
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <mazhechaomaillist@gmail.com>님이
작성:
>
> I find something strange. I look at a worker.log, there three subprocess
> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>
>
> ...
> ...
> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678,
last heartbeat : 1477297079165, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850,
last heartbeat : 1477297079644, worker timeout (ms) : 120000
> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678,
last heartbeat : 1477297080165, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850,
last heartbeat : 1477297080814, worker timeout (ms) : 120000
> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678,
last heartbeat : 1477297081165, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850,
last heartbeat : 1477297081816, worker timeout (ms) : 120000
> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat
request to subprocess
> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678,
last heartbeat : 1477297082166, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850,
last heartbeat : 1477297082843, worker timeout (ms) : 120000
> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request to
subprocess
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678,
last heartbeat : 1477297083166, worker timeout (ms) : 120000
>
>
> And then, heartbaet timeout just at 16:18:03.675, before the third
> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
> How could that happen?
>
> ...
> ...
> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243,
name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, name:normalizeLog
exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439)
~[storm-core-1.0.1.jar:1.0.1]
>         ... 6 more
> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
>         ... 1 more
> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[?:1.7.0_79]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
>         ... 1 more
> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243,
name:normalizeLog exitCode:-1, errorString:
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
>
>
>
> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:
>
> I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
> 30000ms, everything looks OK.
>
> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:
>
> I will try to do this, and reply latter. Thanks.
>
> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>
> Could you modify your log level to DEBUG and see worker's log? If you use
> Storm 1.x you can modify log level from UI on the fly.
> ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG
> since it could produce lots of logs.
>
> Two lines:
> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
> - BOLT - sending heartbeat request to subprocess
>
> Two lines will be logged to each 1 second. Please check logs are existing,
> and 'last heartbeat' is updated properly, and also worker timeout is set
> properly.
>
> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <mazhechaomaillist@gmail.com>님이
작성:
>
> I do not set "topology.subprocess.timeout.secs", so "
> supervisor.worker.timeout.secs" will be used according to STORM-1314,
> which is set 30 for my cluster.
> 30 seconds is a very very very big value, it will never take more than 30
> seconds processing my tuple.
> I think there must be problem somewhere else.
>
> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>
> There're many situations for ShellBolt to trigger heartbeat issue, and at
> least STORM-1946 is not the case.
>
> How long does your tuple take to be processed? You need to set subprocess
> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
> time to process. You can even set it fairly big value so that subprocess
> heartbeat issue will not happen.
>
>
> ShellBolt requires that each tuple is handled and acked within heartbeat
> timeout. I struggled to change this behavior for subprocess to periodically
> sends heartbeat, but no luck because of GIL - global interpreter lock (same
> for Ruby). We need to choose one: stick this restriction, or disable
> subprocess heartbeat.
>
> I hope that we can resolve this issue clearly, but I guess multi-thread
> approach doesn't work on Python, Ruby, and any language which uses GIL, and
> I have no idea on alternatives
>
> - Jungtaek Lim (HeartSaVioR).
>
> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <mazhechaomaillist@gmail.com>님이
작성:
>
> I made an issue (STORM-2150
> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
> help?
>
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
>
> java.lang.RuntimeException: subprocess heartbeat timeout at
>
> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
>
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
>
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.
>
>
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>
>
>
>
> --
> Thanks
> Zhechao Ma
>

Mime
View raw message