cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Delaney Manders (JIRA)" <>
Subject [jira] [Resolved] (CASSANDRA-4225) EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
Date Fri, 01 Jun 2012 14:31:23 GMT


Delaney Manders resolved CASSANDRA-4225.

    Resolution: Invalid

My ticket was finally closed by AWS.

Their response:
> The Kernel team has got back to me. They say that there is a new kernel for the AMI which
has some patches in the net_rx area that shows up in your traces.  
I've moved two machines to the new patched AMI, and they've been solid for 3 days now.  I
consider this closed.
> EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
> -----------------------------------------------------------------
>                 Key: CASSANDRA-4225
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.0
>         Environment: Amazon Linux AMI release 2012.03
> 3.2.12-3.2.4.amzn1.x86_64
> m1.xlarge
> Nodes have:
> Cassandra built and installed from source.
> Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64), libtool(2.2.10)
installed from AWS repository.
> Sun Java:
> > java -version
> java version "1.6.0_31"
> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
> Only system changes are:
> echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
> echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
> Setup scripts available.
> Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having 4, DC2
being reserved for Hadoop jobs.  DC2 nodes have not had the same frequency of hard crashes,
though it has happened.
> Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives raided for
> Usage is exclusively write, with all mutations being done in batch mutations, where each
batch mutation has a set of columns added/modified to a single key.  There are ~2000 threads
streaming batch mutations from a web edge of varying size, distributed across DC1.  Client
is Hector(1.0-5) w/ DynamicLoadBalancing.
> In an effort to mitigate this issue, I've removed jna.jar & platform.jar from $CASSANDRA_HOME/lib,
and set disk_access_mode: standard in $CASSANDRA_HOME/conf.cassandra.yaml.  Neither has seemed
to help.
>            Reporter: Delaney Manders
> At fairly random intervals, about once/day, one of my Cassandra nodes does a hard crash
(kernel panic).  
> I can find no system logs (/var/log/*) which have any errors.  No cassandra logs have
any errors.  
> On one machine I was watching as it went down, and caught the following comment:  
> > Message from syslogd@domU-12-31-38-00-64-31 at May  3 18:24:17 ...
> >  kernel:[252906.019808] Oops: 0002 [#1] SMP
> An AWS support guy found one entry in the console logs:
> > [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 #1
> I've replaced two of the nodes with new instances, but all are showing the same behaviour.
> It's very reproduceable on my system, though it takes a little waiting.  Leaving it running
is no big deal for another day or so, I just need to restart Cassandra every once in a while
when I get alerted.  
> I'm open to any additional requested debugging steps before bailing and going back to

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message