storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Tandon <Ravi.Tan...@microsoft.com>
Subject RE: Cascading "not alive" in topology with Storm 0.9.5
Date Tue, 15 Dec 2015 02:35:43 GMT
Try the following:


·        Increase the value of "nimbus.monitor.freq.secs"="120", this will make nimbus to
wait longer before declaring a worker dead. Also check other configs like “supervisor.worker.timeout.secs“
that will allow the system to wait longer before the re-assignment/re-launching workers.

·        Check the write load on the Zookeepers too, that maybe the bottleneck of your cluster
and the co-ordination thereof than the worker nodes themselves. You can choose to have additional
ZK nodes or provide better spec machines for the quorum.

-Ravi

From: Yury Ruchin [mailto:yuri.ruchin@gmail.com]
Sent: Sunday, December 13, 2015 4:22 AM
To: user@storm.apache.org
Subject: Cascading "not alive" in topology with Storm 0.9.5

Hello,

I'm running a large topology using Storm 0.9.5. I have 2.5K executors distributed over 60
workers, 4-5 workers per node. The topology consumes data from Kafka spout.

I regularly observe Nimbus considering topology workers dead by heartbeat timeout. It then
moves executors to other workers, but soon another worker times out. Nimbus moves its executors
and so on. The sequence repeats over and over - in fact, there are cascading worker timeouts
in topology which it cannot restore from.The topology itself looks alive but stops consuming
from Kafka and as the result stops processing altogether.

I didn't see any obvious issues with network, so initially I assumed there might be worker
process failures caused by exceptions/errors inside the process, e. g. OOME. Nothing appeared
in worker logs. I then found that the processes were actually alive when Nimbus declared them
dead - it seems like they simply stopped sending heartbeats for some reason.

I looked for Java fatal error logs in assumption that the error might be caused by some nasty
low-level things happening - but found nothing.

I suspected high CPU usage, but it turned out the user CPU + system CPU on the nodes never
went above 50-60% in peaks. The regular load was even less.

I was observing the same issue with Storm 0.9.3, then upgraded to Storm 0.9.5 hoping that
fixes for https://issues.apache.org/jira/browse/STORM-329<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-329&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=j7KqlX9nKf7abFTWur0lsIeXNBZUXwCCga7X1Mei7yY%3d>
and https://issues.apache.org/jira/browse/STORM-404<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-404&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=1iDLe2Jr5qZAmuiOYXJomzqdX5G3XqZDFPSkP4wOt2g%3d>
will help. But they haven't.

Strange enough, I can only reproduce the issue in this large setup. Small test setups with
2 workers do not expose this issue - even after killing all worker processes by kill -9 they
restore seamlessly.

My other guess is that large number of workers causes significant overhead on establishing
Netty connections during worker startup which somehow prevents heartbeats from being sent.
Maybe this is something similar to https://issues.apache.org/jira/browse/STORM-763<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-763&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mWe7i%2bVejDHainxeYwaybylchyhPisCwT3q6skqTIl0%3d>
and it's worth upgrading to 0.9.6 - I don't know how to check it.

Any help is appreciated.


Mime
View raw message