cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
Date Tue, 20 Nov 2018 10:54:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693043#comment-16693043
] 

Benedict commented on CASSANDRA-14855:
--------------------------------------

That seems like a reasonable path forward to me, yes.  I'll review the patch once you have
it prepared.

I'd propose using a more descriptive name for the property, though, that's prefixed by 'native_transport'

> Message Flusher scheduling fell off the event loop, resulting in out of memory
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14855
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Major
>             Fix For: 3.0.17
>
>         Attachments: blocked_thread_pool.png, cpu.png, eventloop_scheduledtasks.png,
flusher running state.png, heap.png, heap_dump.png, read_latency.png
>
>
> We recently had a production issue where about 10 nodes in a 96 node cluster ran out
of heap. 
> From heap dump analysis, I believe there is enough evidence to indicate `queued` data
member of the Flusher got too big, resulting in out of memory.
> Below are specifics on what we found from the heap dump (relevant screenshots attached):
> * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, and multiple
such instances.
> * "running" data member of Flusher having "true" value
> * Size of scheduledTasks on the eventloop was 0.
> We suspect something (maybe an exception) caused the Flusher running state to continue
to be true, but was not able to schedule itself with the event loop.
> Could not find any ERROR in the system.log, except for following INFO logs around the
incident time.
> {code:java}
> INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - Unexpected
exception during request; channel = [id: 0x8d288811, L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886]
> io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed
out
>  at io.netty.channel.unix.Errors.newIOException(Errors.java:117) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.Errors.ioResult(Errors.java:138) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238)
~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926)
~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
[netty-all-4.0.44.Final.jar:4.0.44.Final]
> {code}
> I would like to pursue the following proposals to fix this issue:
> # ImmediateFlusher: Backport trunk's ImmediateFlusher ( [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651]
https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec)  to 3.0.x
and maybe to other versions as well, since ImmediateFlusher seems to be more robust than the
existing Flusher as it does not depend on any running state/scheduling.
> # Make "queued" data member of the Flusher bounded to avoid any potential of causing
out of memory due to otherwise unbounded nature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message