cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6364) There should be different disk_failure_policies for data and commit volumes or commit volume failure should always cause node exit
Date Mon, 03 Feb 2014 09:04:10 GMT


Marcus Eriksson commented on CASSANDRA-6364:

About the ignore case, lets hard code something for now - rate limit at one log error message
per second perhaps?

I don't think we should default to 'ignore' in - if someone does a minor upgrade
they most likely wont check NEWS or update their config files to add the new parameter.

The shipped config in cassandra.yaml looks wrong, should be commit_failure_policy, not disk_failure_policy
I guess

> There should be different disk_failure_policies for data and commit volumes or commit
volume failure should always cause node exit
> ----------------------------------------------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-6364
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: JBOD, single dedicated commit disk
>            Reporter: J. Ryan Earl
>            Assignee: Benedict
>             Fix For: 2.0.5
> We're doing fault testing on a pre-production Cassandra cluster.  One of the tests was
to simulation failure of the commit volume/disk, which in our case is on a dedicated disk.
 We expected failure of the commit volume to be handled somehow, but what we found was that
no action was taken by Cassandra when the commit volume fail.  We simulated this simply by
pulling the physical disk that backed the commit volume, which resulted in filesystem I/O
errors on the mount point.
> What then happened was that the Cassandra Heap filled up to the point that it was spending
90% of its time doing garbage collection.  No errors were logged in regards to the failed
commit volume.  Gossip on other nodes in the cluster eventually flagged the node as down.
 Gossip on the local node showed itself as up, and all other nodes as down.
> The most serious problem was that connections to the coordinator on this node became
very slow due to the on-going GC, as I assume uncommitted writes piled up on the JVM heap.
 What we believe should have happened is that Cassandra should have caught the I/O error and
exited with a useful log message, or otherwise done some sort of useful cleanup.  Otherwise
the node goes into a sort of Zombie state, spending most of its time in GC, and thus slowing
down any transactions that happen to use the coordinator on said node.
> A limit on in-memory, unflushed writes before refusing requests may also work.  Point
being, something should be done to handle the commit volume dying as doing nothing results
in affecting the entire cluster.  I should note, we are using: disk_failure_policy: best_effort

This message was sent by Atlassian JIRA

View raw message