storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roshan Naik <roshan_n...@yahoo.com>
Subject Re: Re:Flooded topology after a full GC
Date Thu, 19 Dec 2019 23:53:15 GMT
Some thoughts:
If you have ACKing enabled... you can control the number of inflight msgs using topology.max.spout.pending.
It will constrain the spouts from producing more msgs. 
Long GC could potentially cause a time out thereby requiring the spout to re produce the msgs.
However, the timed-out msgs that are already in flight will continue to drain out (assuming
they were not lost due to worker crash)... so there will be duplicate delivery. Individual
msgs in the "tuple tree" are not tracked to see if there is a timeout at every hop (i.e bolt).

The sudden burst of msgs  after a STW GC might be due to timeout causing spouts to re-emit.
Check GC logs to see how long the STW cycle takes. Increasing msg timeout accordingly and
also reducing inflight msgs could help this situation. Keep in mind that each worker will
have its own STW  GC cycle and that means there is a possibility that a single tuple tree
can hit multiple STW cycles... based on how many hops are involved.

-roshan






On Thursday, December 19, 2019, 06:59:55 AM PST, Ramin Farajollah (BLOOMBERG/ 731 LEX) <rfarajollah@bloomberg.net>
wrote: 





Correction: HotSpot 8 (not OpenJDK 8)

From: user@storm.apache.org At: 12/19/19 09:56:34
To:  user@storm.apache.org
Subject: Flooded topology after a full GC

> Hi,
> 
> We use an object pool for messages in tuples. It has been effective to reduce GCs in
creating the heavy objects.
> 
> After a full GC (~30sec), the Zookeeper connection is suspended and is restored by Curator.
This is followed by a huge rise in the number of the objects (presumably in flight). This
leads into more frequent full GCs and the eventual crash of the topology.
> 
> I'm trying to understand what triggers the huge rise immediately after STW of full GC/Curator
reconnect. My guess is that all tuples had failed due to zk timeout and were resent. In addition,
there may be acks/fails signals exasperating the situation.
> 
> My questions are:
> 1) How to determine if tuples are resent?
> 2) How to determine if acks/fails contribute to the traffic?
> 3) Without back pressure, are excessive tuples are silently discarded from the outbound
or the inbound queues?
> 4) What happens to the failed tuples? (I need a hook to release the objects).
> 
> Details:
> - OpenJDK 8
> - Storm 1.2.3
> - Curator 2.12.0
> - zk session timeout 40000 ms, connection timeout 1500 ms
> - Initially the cache is adequate (8gb) 


Mime
View raw message