cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carl Yeksigian (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4554) Log when a node is down longer than the hint window and we stop saving hints
Date Tue, 01 Jan 2013 19:24:12 GMT


Carl Yeksigian commented on CASSANDRA-4554:

I've started working on this issue; saving that a node needs repair is easy, but tracking
the repair is difficult since only the nodes participating in the repair know its state.

I'll outline the case that has me stumped. For simplicity, I assume that Node 1 overlaps only
with Node 2.
- Node 1 goes down, stays down longer than hint window
- Node 3 stops saving hints for Node 1, marks Node 1 as needs repair
- Node 1 comes back online
- Node 4 starts repair between Node 2 and Node 1 by forwarding the streaming repair task
- Node 1 is now up to date and no longer needs repair; Node 2 and Node 4 know this from tracking
the repair task
- Node 3 does not discover this is the case, continues to see Node 1 as needs repair; state
can only be updated if Node 3 initiates a repair

Because of this, I think that the hint state would need to be gossiped. Also, because repairs
are based on cfs, the gossiped object needs to be on a cf basis, not on a node basis, so application
state isn't granular enough to capture this additional state.

I think the possibilities are:
# My example is wrong and I'm missing a component
# The new value needs to be gossiped
# The new value can be incorporated into application state somehow
# Coordinator tells all nodes about the state of the repair. In this case, down nodes would
not receive these updates
# Nodes can only exit the needs repair state by each node executing the repair. Since it is
only informative, this may make sense, but seems misleading
> Log when a node is down longer than the hint window and we stop saving hints
> ----------------------------------------------------------------------------
>                 Key: CASSANDRA-4554
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>            Assignee: Carl Yeksigian
>            Priority: Minor
>             Fix For: 1.2.1
> We know that we need to repair whenever we lose a node or disk permanently (since it
may have had undelivered hints on it), but without exposing this we don't know when nodes
stop saving hints for a temporarily dead node, unless we're paying very close attention to
external monitoring.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message