storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Weathers <eweath...@groupon.com>
Subject Re: How Does Nimbus Decide to Restart Topology?
Date Sun, 01 May 2016 21:52:05 GMT
Maybe disk I/O was high?  Are you capturing GC logs to disk in unique files
(you can sub in the PID and timestamp into the GC log filename)?  I know
you believe it's not responsible, but it's the only thing I've ever found
to be responsible thus far.  (Except for a problem in storm 0.9.3 with
netty that has since been fixed -- we worked around that by downgrading to
zero-MQ.)  You might try monitoring the heartbeat files written by the
workers to watch for the file creation to be happening less frequently than
once per second.

> all of the worker sessions expired at the same time

I'm assuming you're saying that multiple workers had state :timed-out at
once?  Was that on the same host?  If the state is :disallowed, that is
perfectly normal when the reassignment happens, as I described earlier.

- Erik

On Sunday, May 1, 2016, Kevin Conaway <kevin.a.conaway@gmail.com> wrote:

> Any tips on where to continue investigating or other metrics to capture?
>
> As i mentioned before, the topology was mostly idle. Low cpu usage, low gc
> time (cms parnew), stable heap, no eth errors. Its hard to see why all of
> the worker sessions expired at the same time
>
> On Sunday, May 1, 2016, Erik Weathers <eweathers@groupon.com
> <javascript:_e(%7B%7D,'cvml','eweathers@groupon.com');>> wrote:
>
>> To be clear, the supervisor wasn't shutting  itself down, it was killing
>> the worker process.
>>
>> Also for clarity, the :disallowed state simply means a new
>> task->worker assignment has been computed by the nimbus and so the
>> supervisor has noticed from the ZK state that the currently running worker
>> is no longer supposed to be running.
>>
>> Regarding the :timed-out state it *definitely* means the worker was hung
>> and couldn't write to the local heartbeat file within the timeout (30 secs
>> by default).  As for *why*... That is hard to determine.
>>
>> - Erik
>>
>> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.conaway@gmail.com> wrote:
>>
>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>> play here.
>>>
>>> 3. Worker process died due to exception (this is almost always what we
>>> see)
>>> 4. Worker process hung (eg doing GC).
>>>
>>> I don't think its either of these, TBH.  There were no abnormal
>>> terminations in the logs (or in the Storm error logs). I have the GC
>>> metrics that were reported to graphite and the GC time was fairly low
>>> (~50ms).
>>>
>>> My feeling is that it is Zookeeper related and not worker related
>>> because it appears that ZK also expired sessions from the supervisor nodes
>>> as well:
>>>
>>> I saw the following messages in the log of one of the ZK nodes:
>>>
>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>
>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>> for sessionid: 0x45453e198e80076
>>>
>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
>>> client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>
>>> And some corresponding log messages on one of the supervisor log files
>>> (for the same ZK session ID).  It looks like this occurred right as the
>>> supervisor was shutting down due to Nimbus restarting it:
>>>
>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
>>> additional data from server sessionid 0x45453e198e80076, likely server has
>>> closed socket, closing socket connection and attempting reconnect
>>>
>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74]
>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>
>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>
>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>
>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>> execution of cleanup threads on worker.
>>>
>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>> State change: SUSPENDED
>>>
>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>> :disconnected::none: with disconnected Zookeeper.
>>>
>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
>>> connection to server <zk host>:2181. Will not attempt to authenticate using
>>> SASL (unknown error)
>>>
>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>> connection established to <zk host>:2181, initiating session
>>>
>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>> closing socket connection
>>>
>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>> State change: LOST
>>>
>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>> :expired::none: with disconnected Zookeeper.
>>>
>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>> expired event received
>>>
>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <eweathers@groupon.com>
>>> wrote:
>>>
>>>> (I don't know anything about the pacemaker service introduced in storm
>>>> 1.0, so this statement is pre-1.0).
>>>>
>>>> The executor threads within the worker processes write heartbeats to
>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>>>> things:
>>>>
>>>> 1. ZK too busy? (Seems unlikely)
>>>> 2. Network too busy? (Seems unlikely)
>>>> 3. Worker process died due to exception (this is almost always what we
>>>> see)
>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>> caught by the supervisor on that host since it checks a local hearbeat file
>>>> that the worker normally writes to every second -- if the heartbeat doesn't
>>>> get refreshed before the timeout then the supervisor kills the worker
>>>> process with State being :timed-out (or :time-out, something like that).
>>>> This of course depends on the various timeout config values you have on the
>>>> worker and nimbus hosts.)
>>>>
>>>> - Erik
>>>>
>>>>
>>>> On Saturday, April 30, 2016, Kevin Conaway <kevin.a.conaway@gmail.com>
>>>> wrote:
>>>>
>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart
>>>>> our topology.  From researching past threads it seems like this is related
>>>>> to not receiving heartbeats from the supervisors but I'm unsure if this
was
>>>>> the case.  Our topology was mostly idle at the time that the restart
was
>>>>> triggered.
>>>>>
>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes,
>>>>> I saw the following messages at the time of the restart:
>>>>>
>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>
>>>>> In the nimbus log, there was the following log message:
>>>>>
>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8
8]
>>>>> not alive
>>>>>
>>>>> Shortly thereafter, the supervisors started restarting the workers.
>>>>> The  following log message was in the supervisor log:
>>>>>
>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75]
[51 51]
>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87
87]
>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>
>>>>> Previous threads have suggested that this was due to heavy GC causing
>>>>> the heartbeats not to reach Zookeeper but the topology was idle at this
>>>>> time so I don't think GC was the culprit.  The GC par new time was about
>>>>> 50ms on each node (as reported to Graphite).
>>>>>
>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>
>

Mime
View raw message