cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
Date Fri, 30 Nov 2018 11:44:00 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benedict updated CASSANDRA-14855:
---------------------------------
    Reviewers: Benedict

> Message Flusher scheduling fell off the event loop, resulting in out of memory
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14855
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.17
>
>         Attachments: blocked_thread_pool.png, cpu.png, eventloop_scheduledtasks.png,
flusher running state.png, heap.png, heap_dump.png, read_latency.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We recently had a production issue where about 10 nodes in a 96 node cluster ran out
of heap. 
> From heap dump analysis, I believe there is enough evidence to indicate `queued` data
member of the Flusher got too big, resulting in out of memory.
> Below are specifics on what we found from the heap dump (relevant screenshots attached):
> * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, and multiple
such instances.
> * "running" data member of Flusher having "true" value
> * Size of scheduledTasks on the eventloop was 0.
> We suspect something (maybe an exception) caused the Flusher running state to continue
to be true, but was not able to schedule itself with the event loop.
> Could not find any ERROR in the system.log, except for following INFO logs around the
incident time.
> {code:java}
> INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - Unexpected
exception during request; channel = [id: 0x8d288811, L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886]
> io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed
out
>  at io.netty.channel.unix.Errors.newIOException(Errors.java:117) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.Errors.ioResult(Errors.java:138) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238)
~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926)
~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
[netty-all-4.0.44.Final.jar:4.0.44.Final]
> {code}
> I would like to pursue the following proposals to fix this issue:
> # ImmediateFlusher: Backport trunk's ImmediateFlusher ( [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651]
https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec)  to 3.0.x
and maybe to other versions as well, since ImmediateFlusher seems to be more robust than the
existing Flusher as it does not depend on any running state/scheduling.
> # Make "queued" data member of the Flusher bounded to avoid any potential of causing
out of memory due to otherwise unbounded nature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message