cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Petrov (Jira)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
Date Sun, 10 Nov 2019 11:27:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971100#comment-16971100
] 

Alex Petrov commented on CASSANDRA-15400:
-----------------------------------------

I've noticed that the patch uses {{validateIfFixedSize}}. I intended to fix it in some other
patch, but wanted to let you know that {{validateIfFixedSize}} is not implemented for {{ByteType}}
and {{ShortType}} even though they're fixed size.

> Cassandra 3.0.18 went OOM several hours after joining a cluster
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-15400
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15400
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/SSTable
>            Reporter: Thomas Steinmaurer
>            Assignee: Blake Eggleston
>            Priority: Normal
>             Fix For: 3.0.20, 3.11.6, 4.0
>
>         Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, cassandra_hprof_dominator_classes.png,
cassandra_hprof_statsmetadata.png, cassandra_jvm_metrics.png, cassandra_operationcount.png,
cassandra_sstables_pending_compactions.png
>
>
> We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an
OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours
being successfully bootstrapped.
> Running in AWS:
> * m5.2xlarge, EBS SSD (gp2)
> * Xms/Xmx12G, Xmn3G, CMS GC, OpenJDK8u222
> * 4 compaction threads, throttling set to 32 MB/s
> What we see is a steady increase in the OLD gen over many hours.
> !cassandra_jvm_metrics.png!
> * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00
> * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00
also starting to be a member of serving client read requests
> !cassandra_operationcount.png!
> Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly
increased.
> We see a correlation in increased number of SSTables and pending compactions.
> !cassandra_sstables_pending_compactions.png!
> Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric
gap in the chart above), number of SSTables + pending compactions is still high, but without
facing memory troubles since then.
> This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader
instances with ~ 8.7GByte retained heap in total.
> !cassandra_hprof_dominator_classes.png!
> Having a closer look on a single object instance, seems like each instance is ~ 2MByte
in size.
> !cassandra_hprof_bigtablereader_statsmetadata.png!
> With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each
> We have been running with 2.1.18 for > 3 years and I can't remember dealing with such
OOM in the context of extending a cluster.
> While the MAT screens above are from our production cluster, we partly can reproduce
this behavior in our loadtest environment (although not going full OOM there), thus I might
be able to share a hprof from this non-prod environment if needed.
> Thanks a lot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message