storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhechao Ma <mazhechaomaill...@gmail.com>
Subject Re: ShellBolt raise subprocess heartbeat timeout Exception
Date Tue, 25 Oct 2016 11:31:02 GMT
Besides, I have another question.

In ShellBolt.java, the method setHeartbeat() is called in method run() of
class BoltReaderRunnable, but why not called in the same place in class
BoltWriterRunnable? In other words, when should setHeartbeat() be called?

2016-10-24 18:32 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:

> Really, there's timeout ! Last heartbeat doesn't update properly. I'll try
> to adjust worker log format to determin which thread.
> Thanks.
>
> 2016-10-24 17:14 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>
>> Sorry but you seem to run multiple ShellBolts in one worker, so I can't
>> see which threads print out which log.
>> Could you add %t to worker log format? You can just copy the content from
>> below link to your worker.xml.
>> https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
>>
>> And there's actual heartbeat timeout occurred:
>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
>> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>
>> 1477297083673 - 1477296962746 = 120927 > 120000
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <mazhechaomaillist@gmail.com>님이
작성:
>>
>>> I find something strange. I look at a worker.log, there three subprocess
>>> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>>>
>>>
>>> ...
>>> ...
>>> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079678,
last heartbeat : 1477297079165, worker timeout (ms) : 120000
>>> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297079850,
last heartbeat : 1477297079644, worker timeout (ms) : 120000
>>> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080678,
last heartbeat : 1477297080165, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297080850,
last heartbeat : 1477297080814, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081678,
last heartbeat : 1477297081165, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297081850,
last heartbeat : 1477297081816, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
>>> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082678,
last heartbeat : 1477297082166, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297082850,
last heartbeat : 1477297082843, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat request
to subprocess
>>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673,
last heartbeat : 1477296962746, worker timeout (ms) : 120000
>>> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083678,
last heartbeat : 1477297083166, worker timeout (ms) : 120000
>>>
>>>
>>> And then, heartbaet timeout just at 16:18:03.675, before the third
>>> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
>>> How could that happen?
>>>
>>> ...
>>> ...
>>> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
>>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException:
pid:7243, name:normalizeLog exitCode:-1, errorString:
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) [storm-core-1.0.1.jar:1.0.1]
>>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
>>> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243,
name:normalizeLog exitCode:-1, errorString:
>>>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439)
~[storm-core-1.0.1.jar:1.0.1]
>>>         ... 6 more
>>> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1,
errorString:
>>>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) ~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[?:1.7.0_79]
>>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
>>>         ... 1 more
>>> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>>>         at org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[?:1.7.0_79]
>>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) ~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
>>>         ... 1 more
>>> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
>>> java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException:
pid:7243, name:normalizeLog exitCode:-1, errorString:
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
>>>         at org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
>>>
>>>
>>>
>>> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:
>>>
>>> I open DEBUG-level log, and see the BOLT heartbeat information, timeout
>>> is 30000ms, everything looks OK.
>>>
>>> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <mazhechaomaillist@gmail.com>:
>>>
>>> I will try to do this, and reply latter. Thanks.
>>>
>>> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>>>
>>> Could you modify your log level to DEBUG and see worker's log? If you
>>> use Storm 1.x you can modify log level from UI on the fly.
>>> ShellBolt writes log regarding subprocess heartbeat but its level is
>>> DEBUG since it could produce lots of logs.
>>>
>>> Two lines:
>>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>>> - BOLT - sending heartbeat request to subprocess
>>>
>>> Two lines will be logged to each 1 second. Please check logs are
>>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>>> is set properly.
>>>
>>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <mazhechaomaillist@gmail.com>님이
>>> 작성:
>>>
>>> I do not set "topology.subprocess.timeout.secs", so "
>>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>>> which is set 30 for my cluster.
>>> 30 seconds is a very very very big value, it will never take more than
>>> 30 seconds processing my tuple.
>>> I think there must be problem somewhere else.
>>>
>>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <kabhwan@gmail.com>:
>>>
>>> There're many situations for ShellBolt to trigger heartbeat issue, and
>>> at least STORM-1946 is not the case.
>>>
>>> How long does your tuple take to be processed? You need to set
>>> subprocess timeout seconds ("topology.subprocess.timeout.secs") to
>>> higher than max time to process. You can even set it fairly big value so
>>> that subprocess heartbeat issue will not happen.
>>>
>>>
>>> ShellBolt requires that each tuple is handled and acked within heartbeat
>>> timeout. I struggled to change this behavior for subprocess to periodically
>>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>>> for Ruby). We need to choose one: stick this restriction, or disable
>>> subprocess heartbeat.
>>>
>>> I hope that we can resolve this issue clearly, but I guess multi-thread
>>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>>> I have no idea on alternatives
>>>
>>> - Jungtaek Lim (HeartSaVioR).
>>>
>>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <mazhechaomaillist@gmail.com>님이
>>> 작성:
>>>
>>> I made an issue (STORM-2150
>>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>>> can
>>> help?
>>>
>>> I've got a simple topology running with Storm 1.0.1. The topology
>>> consists
>>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>>> the following exceptions.
>>>
>>> java.lang.RuntimeException: subprocess heartbeat timeout at
>>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(S
>>> hellBolt.java:322)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>> Executors.java:471)
>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.run(ScheduledThreadPoolExecutor.java:293)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1145)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> More information here:
>>> 1. Topology run with ACK mode.
>>> 2. Topology had 40 workers.
>>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>>
>>> Every time subprocess heartbeat timeout, workers would restart and python
>>> processes exited with exitCode:-1, which affected processing capacity and
>>> stability of the topology.
>>>
>>> I've checked some related issues from Storm Jira. I first found
>>> STORM-1946
>>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug
>>> related
>>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>>> the same exception even after I upgraded Storm to 1.0.2.
>>>
>>> I checked other related issues. Let's look at history of this problem.
>>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>>> <https://issues.apache.org/jira/browse/STORM-742> discussed the
>>> approach of
>>> this problem with ACK mode, and it seemed that bug had been fixed in
>>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>>> In
>>> a word, this problem still exists in the latest stable version.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Zhechao Ma
>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Zhechao Ma
>>>
>>
>
>
> --
> Thanks
> Zhechao Ma
>



-- 
Thanks
Zhechao Ma

Mime
View raw message