cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sumanth Pasupuleti (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
Date Fri, 09 Nov 2018 21:27:00 GMT


Sumanth Pasupuleti commented on CASSANDRA-14855:

[~zznate] Yes, this was the first time we saw this (and there was a second incident with similar
characteristics on the same cluster). This happened only on one cluster (this cluster is our
most read heavy 3.0 CQL cluster), and following are the characteristics:
* 3.0.17 C* version
* Relatively high read traffic (~60k rps at peak at coordinator level)
* Has client side wire compression (LZ4) enabled
* Total outbound traffic of ~4Gbps across the cluster

> Message Flusher scheduling fell off the event loop, resulting in out of memory
> ------------------------------------------------------------------------------
>                 Key: CASSANDRA-14855
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sumanth Pasupuleti
>            Priority: Major
>             Fix For: 3.0.17
>         Attachments: blocked_thread_pool.png, cpu.png, eventloop_scheduledtasks.png,
flusher running state.png, heap.png, heap_dump.png, read_latency.png
> We recently had a production issue where about 10 nodes in a 96 node cluster ran out
of heap. 
> From heap dump analysis, I believe there is enough evidence to indicate `queued` data
member of the Flusher got too big, resulting in out of memory.
> Below are specifics on what we found from the heap dump (relevant screenshots attached):
> * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, and multiple
such instances.
> * "running" data member of Flusher having "true" value
> * Size of scheduledTasks on the eventloop was 0.
> We suspect something (maybe an exception) caused the Flusher running state to continue
to be true, but was not able to schedule itself with the event loop.
> Could not find any ERROR in the system.log, except for following INFO logs around the
incident time.
> {code:java}
> INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 - Unexpected
exception during request; channel = [id: 0x8d288811, L:/ - R:/xxx.xx.x.xx:18886]
>$NativeIoException: readAddress() failed: Connection timed
>  at ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at
>  at$EpollStreamUnsafe.epollInReady(
>  at [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.util.concurrent.SingleThreadEventExecutor$
>  at io.netty.util.concurrent.DefaultThreadFactory$
> {code}
> I would like to pursue the following proposals to fix this issue:
> # ImmediateFlusher: Backport trunk's ImmediateFlusher ( [CASSANDRA-13651|]  to 3.0.x
and maybe to other versions as well, since ImmediateFlusher seems to be more robust than the
existing Flusher as it does not depend on any running state/scheduling.
> # Make "queued" data member of the Flusher bounded to avoid any potential of causing
out of memory due to otherwise unbounded nature.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message