storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <>
Subject Re: ShellBolt raise subprocess heartbeat timeout Exception
Date Fri, 21 Oct 2016 03:11:05 GMT
There're many situations for ShellBolt to trigger heartbeat issue, and at
least STORM-1946 is not the case.

How long does your tuple take to be processed? You need to set subprocess
timeout seconds ("topology.subprocess.timeout.secs") to higher than max
time to process. You can even set it fairly big value so that subprocess
heartbeat issue will not happen.

ShellBolt requires that each tuple is handled and acked within heartbeat
timeout. I struggled to change this behavior for subprocess to periodically
sends heartbeat, but no luck because of GIL - global interpreter lock (same
for Ruby). We need to choose one: stick this restriction, or disable
subprocess heartbeat.

I hope that we can resolve this issue clearly, but I guess multi-thread
approach doesn't work on Python, Ruby, and any language which uses GIL, and
I have no idea on alternatives

- Jungtaek Lim (HeartSaVioR).

2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <>님이

> I made an issue (STORM-2150
> <>) 3 days ago, anyone can
> help?
> I've got a simple topology running with Storm 1.0.1. The topology consists
> of a KafkaSpout and several python multilang ShellBolt. I frequently got
> the following exceptions.
> java.lang.RuntimeException: subprocess heartbeat timeout at
> org.apache.storm.task.ShellBolt$
> at java.util.concurrent.Executors$
> at java.util.concurrent.FutureTask.runAndReset( at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> at
> java.util.concurrent.ThreadPoolExecutor$
> at
> More information here:
> 1. Topology run with ACK mode.
> 2. Topology had 40 workers.
> 3. Topology emitted about 10 milliom tuples every 10 minutes.
> Every time subprocess heartbeat timeout, workers would restart and python
> processes exited with exitCode:-1, which affected processing capacity and
> stability of the topology.
> I've checked some related issues from Storm Jira. I first found STORM-1946
> <> reported a bug related
> to this problem and said bug had been fixed in Storm 1.0.2. However I got
> the same exception even after I upgraded Storm to 1.0.2.
> I checked other related issues. Let's look at history of this problem.
> DashengJu first reported this problem with Non-ACK mode in STORM-738
> <>. STORM-742
> <> discussed the approach
> of
> this problem with ACK mode, and it seemed that bug had been fixed in
> 0.10.0. I don't know whether this patch is included in storm-1.x branch. In
> a word, this problem still exists in the latest stable version.

View raw message