storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Conaway <kevin.a.cona...@gmail.com>
Subject Re: How Does Nimbus Decide to Restart Topology?
Date Mon, 02 May 2016 15:21:32 GMT
Unfortunately we're not capturing disk i/o in our metrics, I can look in to
doing that for next time.

We're not capturing GC logs, we are using the graphite storm metric
consumer to push metrics to graphite, one of which is the GC time from the
default GC mxbean.

> I'm assuming you're saying that multiple workers had state :timed-out at
once?

We have 6 workers.  Only one had state :timed-out, the others had state
:disallowed.  Looking at one of the supervisors as an example, it looks
like it received multiple requests to reschedule the worker which caused
the worker to be launched multiple times.  Is this normal?

2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current supervisor time:
1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
:storm-id "<topology>", :executors [[124 124] [64 64] [196 196] [40 40] [28
28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148 148] [136
136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700}

2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7

2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current supervisor time:
1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
:storm-id "<topology>", :executors [[178 178] [58 58] [190 190] [118 118]
[22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46 46] [82
82] [154 154] [94 94] [34 34] [130 130]], :port 6701}

2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841

2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with
assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
[163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153
153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113
113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port
6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67

2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with
assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
[198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78
78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701
with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6


2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current supervisor time:
1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
:storm-id "<topology>", :executors [[3 3] [33 33] [103 103] [163 163] [53
53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1] [153 153] [13
13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113 113]],
:port 6700}

2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67

2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current supervisor time:
1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
:storm-id "<topology>", :executors [[8 8] [188 188] [68 68] [198 198] [178
178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 48] [-1 -1] [148
148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78 78]],
:port 6701}

2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6

On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <eweathers@groupon.com> wrote:

> Maybe disk I/O was high?  Are you capturing GC logs to disk in unique
> files (you can sub in the PID and timestamp into the GC log filename)?  I
> know you believe it's not responsible, but it's the only thing I've ever
> found to be responsible thus far.  (Except for a problem in storm 0.9.3
> with netty that has since been fixed -- we worked around that by
> downgrading to zero-MQ.)  You might try monitoring the heartbeat files
> written by the workers to watch for the file creation to be happening less
> frequently than once per second.
>
> > all of the worker sessions expired at the same time
>
> I'm assuming you're saying that multiple workers had state :timed-out at
> once?  Was that on the same host?  If the state is :disallowed, that is
> perfectly normal when the reassignment happens, as I described earlier.
>
> - Erik
>
> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.conaway@gmail.com> wrote:
>
>> Any tips on where to continue investigating or other metrics to capture?
>>
>> As i mentioned before, the topology was mostly idle. Low cpu usage, low
>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all
>> of the worker sessions expired at the same time
>>
>> On Sunday, May 1, 2016, Erik Weathers <eweathers@groupon.com> wrote:
>>
>>> To be clear, the supervisor wasn't shutting  itself down, it was killing
>>> the worker process.
>>>
>>> Also for clarity, the :disallowed state simply means a new
>>> task->worker assignment has been computed by the nimbus and so the
>>> supervisor has noticed from the ZK state that the currently running worker
>>> is no longer supposed to be running.
>>>
>>> Regarding the :timed-out state it *definitely* means the worker was
>>> hung and couldn't write to the local heartbeat file within the timeout (30
>>> secs by default).  As for *why*... That is hard to determine.
>>>
>>> - Erik
>>>
>>> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.conaway@gmail.com> wrote:
>>>
>>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>>> play here.
>>>>
>>>> 3. Worker process died due to exception (this is almost always what we
>>>> see)
>>>> 4. Worker process hung (eg doing GC).
>>>>
>>>> I don't think its either of these, TBH.  There were no abnormal
>>>> terminations in the logs (or in the Storm error logs). I have the GC
>>>> metrics that were reported to graphite and the GC time was fairly low
>>>> (~50ms).
>>>>
>>>> My feeling is that it is Zookeeper related and not worker related
>>>> because it appears that ZK also expired sessions from the supervisor nodes
>>>> as well:
>>>>
>>>> I saw the following messages in the log of one of the ZK nodes:
>>>>
>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>
>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>>> for sessionid: 0x45453e198e80076
>>>>
>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
>>>> client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>>
>>>> And some corresponding log messages on one of the supervisor log files
>>>> (for the same ZK session ID).  It looks like this occurred right as the
>>>> supervisor was shutting down due to Nimbus restarting it:
>>>>
>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
>>>> additional data from server sessionid 0x45453e198e80076, likely server has
>>>> closed socket, closing socket connection and attempting reconnect
>>>>
>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38]
[98
>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74]
>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>>
>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>
>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>
>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>>> execution of cleanup threads on worker.
>>>>
>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>>> State change: SUSPENDED
>>>>
>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>>> :disconnected::none: with disconnected Zookeeper.
>>>>
>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
>>>> connection to server <zk host>:2181. Will not attempt to authenticate
using
>>>> SASL (unknown error)
>>>>
>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>>> connection established to <zk host>:2181, initiating session
>>>>
>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>>> closing socket connection
>>>>
>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>>> State change: LOST
>>>>
>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>>> :expired::none: with disconnected Zookeeper.
>>>>
>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>>> expired event received
>>>>
>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <eweathers@groupon.com>
>>>> wrote:
>>>>
>>>>> (I don't know anything about the pacemaker service introduced in storm
>>>>> 1.0, so this statement is pre-1.0).
>>>>>
>>>>> The executor threads within the worker processes write heartbeats to
>>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be
many
>>>>> things:
>>>>>
>>>>> 1. ZK too busy? (Seems unlikely)
>>>>> 2. Network too busy? (Seems unlikely)
>>>>> 3. Worker process died due to exception (this is almost always what we
>>>>> see)
>>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>>> caught by the supervisor on that host since it checks a local hearbeat
file
>>>>> that the worker normally writes to every second -- if the heartbeat doesn't
>>>>> get refreshed before the timeout then the supervisor kills the worker
>>>>> process with State being :timed-out (or :time-out, something like that).
>>>>> This of course depends on the various timeout config values you have
on the
>>>>> worker and nimbus hosts.)
>>>>>
>>>>> - Erik
>>>>>
>>>>>
>>>>> On Saturday, April 30, 2016, Kevin Conaway <kevin.a.conaway@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart
>>>>>> our topology.  From researching past threads it seems like this is
related
>>>>>> to not receiving heartbeats from the supervisors but I'm unsure if
this was
>>>>>> the case.  Our topology was mostly idle at the time that the restart
was
>>>>>> triggered.
>>>>>>
>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes,
>>>>>> I saw the following messages at the time of the restart:
>>>>>>
>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>>
>>>>>> In the nimbus log, there was the following log message:
>>>>>>
>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8
8]
>>>>>> not alive
>>>>>>
>>>>>> Shortly thereafter, the supervisors started restarting the workers.
>>>>>> The  following log message was in the supervisor log:
>>>>>>
>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75
75] [51 51]
>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27]
[87 87]
>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>>
>>>>>> Previous threads have suggested that this was due to heavy GC causing
>>>>>> the heartbeats not to reach Zookeeper but the topology was idle at
this
>>>>>> time so I don't think GC was the culprit.  The GC par new time was
about
>>>>>> 50ms on each node (as reported to Graphite).
>>>>>>
>>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>>
>>>>>> --
>>>>>> Kevin Conaway
>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>> https://github.com/kevinconaway
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>>


-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Mime
View raw message