storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stig Rohde Døssing <s...@apache.org>
Subject Re: Storm hung issue
Date Thu, 01 Mar 2018 17:08:35 GMT
Agree that you should consider upgrading.

Your description here " We tried to restart the same Storm topology but it
will fail within 1-2 minutes by processing around 15K-16K. If we decrease
the max.spout.pending value then it fails by processing only few tuples. "
also makes me wonder if there's something about the tuples that might cause
your bolts to hang? You might want to try taking thread dumps of the bolt
workers next time it happens, that should tell you whether the bolts are
stuck in your own code or somewhere in Storm.

2018-03-01 17:54 GMT+01:00 Erik Weathers <eweathers@groupon.com>:

> Agreed, there have been a number of fixes in the storm-kafka spout that
> might account for that problem.  If you need to debug further on 0.9.x you
> shoulda dump the Kafka consumer offsets and see if the topology is getting
> stuck at some specific offsets.  Then examine the data at those offsets
> using a console consumer to try to infer why the topology would get stuck.
>
> - Erik
>
> On Wed, Feb 28, 2018 at 2:41 PM Jungtaek Lim <kabhwan@gmail.com> wrote:
>
>> Hi Ajeesh,
>>
>> Sorry but the version is really outdated, released 3 years ago. Would you
>> mind upgrading to recent version, 1.2.1 for example and see how it help?
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 2월 28일 (수) 오후 9:48, Ajeesh <ajeeshreloaded@gmail.com>님이
작성:
>>
>>> Hi Team,
>>>
>>>     We are facing issues in Storm version 0.9.4, Storm application hangs
>>> after processing for 3-4 days. We tried to restart the same Storm topology
>>> but it will fail within 1-2 minutes by processing around 15K-16K. If we
>>> decrease the max.spout.pending value then it fails by processing only few
>>> tuples.
>>>
>>>     If we start a new topology with new Kafka topic then everything
>>> works fine for 3-4 days. Our daily volume will be around 11 million.
>>>
>>>     Checked the execute latency, it's around 6ms.
>>>     Checked worker logs, there's no error/exceptions.
>>>     Storm visualization graph shows all nodes in "green" color.
>>>
>>>     Workflow:
>>>         KafkaSpout->Bolt-1->Bolt-2->Bolt-3->Bolt-4.
>>>
>>>     Storm configurations:
>>>         No. of workers: 10
>>>         No. of executors: 260
>>>         Max Spout Pending: 50
>>>         No. of KafkaSpout executors: 10
>>>
>>>     TODO:
>>>         1. Wanna take a thread dump
>>>          2. Is there anything you require to know more about this issue?
>>>
>>> Analyzed worker logs in debug mode:
>>>     2018-02-28T05:12:51.462-0500 s.k.PartitionManager [DEBUG] failing at
>>> offset=163348 with _pending.size()=3953 pending and _emittedToOffset=168353
>>> 2018-02-28T05:12:51.461-0500 s.k.PartitionManager [DEBUG] failing at
>>> offset=195116 with _pending.size()=4442 pending and _emittedToOffset=199437
>>> 2018-02-28T05:12:51.463-0500 s.k.PartitionManager [DEBUG] failing at
>>> offset=194007 with _pending.size()=4442 pending and _emittedToOffset=199437
>>> 2018-02-28T05:12:51.463-0500 s.k.PartitionManager [DEBUG] failing at
>>> offset=194700 with _pending.size()=4442 pending and _emittedToOffset=199437
>>> 2018-02-28T05:12:51.463-0500 s.k.PartitionManager [DEBUG] failing at
>>> offset=193891 with _pending.size()=4442 pending and _emittedToOffset=199437
>>> 2018-02-28T05:12:51.463-0500 s.k.PartitionManager [DEBUG] failing at
>>> offset=194455 with _pending.size()=4442 pending and _emittedToOffset=199437
>>> 2018-02-28T05:12:51.463-0500 s.k.PartitionManager [DEBUG] failing at
>>> offset=194632 with _pending.size()=4442 pending and _emittedToOffset=199437
>>>
>>> 2018-02-28T05:14:05.241-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x261d81e8b3d003e after 10ms
>>> 2018-02-28T05:14:05.703-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x361d81df7a80048 after 0ms
>>> 2018-02-28T05:14:05.703-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x261d81e8b3d003d after 0ms
>>> 2018-02-28T05:14:05.745-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x161d81df7a30043 after 2ms
>>> 2018-02-28T05:14:05.775-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x161d81df7a30045 after 3ms
>>> 2018-02-28T05:14:05.849-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x361d81df7a80044 after 1ms
>>> 2018-02-28T05:14:05.969-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x161d81df7a30046 after 0ms
>>> 2018-02-28T05:14:07.067-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x161d81df7a30041 after 11ms
>>> 2018-02-28T05:14:07.131-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x261d81e8b3d003c after 0ms
>>> 2018-02-28T05:14:07.135-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x161d81df7a30042 after 0ms
>>> 2018-02-28T05:14:07.140-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x261d81e8b3d003b after 0ms
>>> 2018-02-28T05:14:07.150-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x161d81df7a30044 after 0ms
>>> 2018-02-28T05:14:08.319-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x361d81df7a8004b after 6ms
>>> 2018-02-28T05:14:08.938-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x261d81e8b3d0042 after 1ms
>>> 2018-02-28T05:14:08.977-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x161d81df7a30047 after 10ms
>>> 2018-02-28T05:14:08.985-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x261d81e8b3d0043 after 6ms
>>> 2018-02-28T05:14:08.985-0500 o.a.z.ClientCnxn [DEBUG] Got ping response
>>> for sessionid: 0x261d81e8b3d0044 after 7ms
>>>
>>>

Mime
View raw message