cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonas Borgström (JIRA) <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak
Date Fri, 01 Mar 2019 15:24:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781779#comment-16781779
] 

Jonas Borgström commented on CASSANDRA-15006:
---------------------------------------------

Thanks [~benedict]! Awesome work, your analysis sounds very reasonable!

I checked the logs and unfortunately these servers only keep logs for 5 days so the logs from
the node startups are since long lost.

But I did find bunch of "INFO Maximum memory usage reached (531628032), cannot allocate chunk
of 1048576" log entries. Pretty much one every hour on the hour. Which probably corresponding
with time of the hourly cassandra snapshots taken on each node.

Do you have any idea what is the source of these "objects with arbitrary lifetimes"? And why
it (at least in my tests) appears to increase linearly forever. If they are related to repairs
somehow I would assume that they would not increase from one repair to the next?

Also, your proposed workaround for 3.11.x to lower the chunk cache and buffer pool settings.
Would that "fix" the problem or simply buy some more time until the process runs out of memory.

I guess instead of lowering these two settings simply raising the configured memory limit
from 3GiB to 4 or 5 GiB without changing the heap size setting would work equally well?

I have no problem with raising my (rather low) memory limit if I knew that I would end up
with a setup that will not run out of memory no matter how long it will be running.

Again, thanks for your help!

> Possible java.nio.DirectByteBuffer leak
> ---------------------------------------
>
>                 Key: CASSANDRA-15006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but that did not
seem to make any difference.
>            Reporter: Jonas Borgström
>            Priority: Major
>         Attachments: CASSANDRA-15006-reference-chains.png, Screenshot_2019-02-04 Grafana
- Cassandra.png, Screenshot_2019-02-14 Grafana - Cassandra(1).png, Screenshot_2019-02-14 Grafana
- Cassandra.png, Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana
- Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly killed by
the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure looks like
the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth (approx 15MiB/24h,
see attached screenshot). Is this expected to keep growing linearly after 12 days with a constant
load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it would take
quite a few days until it becomes noticeable. I'm able to see the same type of slow growth
in other production clusters even though the graph data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message