storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M. Aaron Bossert" <maboss...@gmail.com>
Subject Cascading topology failure
Date Wed, 16 Aug 2017 15:13:26 GMT
All,

I am running a topology that processes PCAP files and then processes each
file by merging packets from many small files into larger files based on
flows.  The intent is to process the resulting merged PCAP files with Bro
and then store the PCAP and Bro in Hbase.  I have NO issues with the logic
of the topology (that I can tell), but am having scaling issues that are
most likely associated with increased parallelism.  I have been doing a lot
of Googling for Netty-related failures (of which I see many in my logs) and
have found an older issue which seems to match my symptoms:
https://github.com/verisign/storm-bolt-of-death/blob/master/README_STORM-0.9.3.md.
Unfortunately, all the answers surrounding the Netty issue point to it
having been fixed since at least 1.0.X (I am using 1.1.0 in HDP 2.6).

The most prominent symptom is that the topology starts up as expected and
runs at roughly 90K tuples per second, then suddenly trails off and seems
to hang with no apparent errors other than a TON of Netty reconnect
timeouts.  Here is a screenshot from Grafana of the System CPU and Memory
usage that illustrates the failure:


​
​

Hardware and Parallelism settings:

I have 3 nodes dedicated to Storm entirely other than running a data node
for HDFS.  Each node has 56 physical cores (4 x 14-core Xeon), 760GB of
RAM, and spinning local disks.  I have set the following:


   - Number of workers: 168
   - Parallelism hints: 1 + 64 + 16 + 64 + 16 = 161
   - Max spout pending: 168 (I was trying to throttle the spout to make
   sure the first bolt didn't choke...since it is taking one tuple as input
   and emitting on average 180K tuples). I have already tried a bunch of
   different settings for this ranging from default of 1000 to 42000 (number
   of files) and not seen any significant improvements
   - message timeout: 600 seconds (I have observed a max thus far of about
   260 seconds to process a PCAP file, though this seems to be when the system
   is getting backed up...on average, it is closer to 10-20 seconds per file)
   - Backpressure: enabled and set for 0.6 and 0.8 for low and high
   watermarks, respectively


Spout (retrieves list of files to process, this run is ~42K) -> parallelism
hint: 1
Bolt1 (reads each PCAP files and sends individual packets with partial key
grouping based on IP pairs) -> parallelism hint: 64
Bolt2 (further splits packets based on IP and Port pairs using partial key
grouping) -> parallelism hint: 16
Bolt3 (sessionize traffic and emit entire sessions using partial key
grouping based on IP and Port pairs) -> parallelism hint: 64
Bolt4 (write sessions to new PCAP files) -> parallelism hint: 16

my logic for selecting the number of workers and parallelism hints was
based on the number of available cores as well as observed
latency/complexity of each bolt.  Right now, the only bolt that seems to be
overloaded is Bolt1, no matter what I do, it stays maxed out.

Here is an excerpt from my worker logs:

2017-08-16 04:11:30.967 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 1000ms (MAX)

2017-08-16 04:11:31.067 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 105 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6713
failed: java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6713

2017-08-16 04:11:31.067 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 1000ms (MAX)

2017-08-16 04:11:32.167 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 106 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6713
failed: java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6713

2017-08-16 04:11:32.167 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 1000ms (MAX)

2017-08-16 04:11:33.266 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 107 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6713
failed: java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6713

2017-08-16 04:11:33.267 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 1000ms (MAX)

2017-08-16 04:12:31.859 o.a.s.m.n.Client refresh-connections-timer [INFO]
creating Netty Client, connecting to r7u15.thanos.gotgdt.net:6715,
bufferSize: 5242880

2017-08-16 04:12:31.870 o.a.s.s.o.a.c.r.ExponentialBackoffRetry
refresh-connections-timer [WARN] maxRetries too large (30). Pinning to 29

2017-08-16 04:12:31.874 o.a.s.m.n.Client refresh-connections-timer [INFO]
closing Netty Client Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:31.875 o.a.s.m.n.Client refresh-connections-timer [INFO]
waiting up to 600000 ms to send 0 pending messages to
Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:37.843 o.a.s.m.n.StormServerHandler
Netty-server-localhost-6700-worker-1 [ERROR] server errors in handling the
request

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_121]

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_121]

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_121]

at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_121]

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_121]

at
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
[storm-core-1.1.0.2.6.1.0-129.jar:1.1.0.2.6.1.0-129]

at
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
[storm-core-1.1.0.2.6.1.0-129.jar:1.1.0.2.6.1.0-129]

at
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
[storm-core-1.1.0.2.6.1.0-129.jar:1.1.0.2.6.1.0-129]

at
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
[storm-core-1.1.0.2.6.1.0-129.jar:1.1.0.2.6.1.0-129]

at
org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
[storm-core-1.1.0.2.6.1.0-129.jar:1.1.0.2.6.1.0-129]

at
org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
[storm-core-1.1.0.2.6.1.0-129.jar:1.1.0.2.6.1.0-129]

at
org.apache.storm.shade.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
[storm-core-1.1.0.2.6.1.0-129.jar:1.1.0.2.6.1.0-129]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[?:1.8.0_121]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[?:1.8.0_121]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

2017-08-16 04:12:42.278 o.a.s.m.n.Client refresh-connections-timer [INFO]
creating Netty Client, connecting to r7u15.thanos.gotgdt.net:6705,
bufferSize: 5242880

2017-08-16 04:12:42.281 o.a.s.s.o.a.c.r.ExponentialBackoffRetry
refresh-connections-timer [WARN] maxRetries too large (30). Pinning to 29

2017-08-16 04:12:42.293 o.a.s.m.n.Client refresh-connections-timer [INFO]
closing Netty Client Netty-Client-r7u16.thanos.gotgdt.net/10.55.50.209:6715

2017-08-16 04:12:42.295 o.a.s.m.n.Client refresh-connections-timer [INFO]
waiting up to 600000 ms to send 0 pending messages to
Netty-Client-r7u16.thanos.gotgdt.net/10.55.50.209:6715

2017-08-16 04:12:42.369 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 1 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705 failed:
java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:42.370 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 102ms (NOT MAX)

2017-08-16 04:12:42.572 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 2 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705 failed:
java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:42.573 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 106ms (NOT MAX)

2017-08-16 04:12:42.766 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 3 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705 failed:
java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:42.767 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 109ms (NOT MAX)

2017-08-16 04:12:42.966 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 4 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705 failed:
java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:42.966 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 117ms (NOT MAX)

2017-08-16 04:12:43.166 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 5 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705 failed:
java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:43.167 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 160ms (NOT MAX)

2017-08-16 04:12:43.366 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 6 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705 failed:
java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6705

2017-08-16 04:12:43.366 o.a.s.u.StormBoundedExponentialBackoffRetry
client-boss-1 [WARN] WILL SLEEP FOR 167ms (NOT MAX)

2017-08-16 04:12:43.567 o.a.s.m.n.Client client-boss-1 [ERROR] connection
attempt 7 to Netty-Client-r7u15.thanos.gotgdt.net/10.55.50.208:6705 failed:
java.net.ConnectException: Connection refused:
r7u15.thanos.gotgdt.net/10.55.50.208:6705

Mime
View raw message