cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaakko Laine (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-562) Make range changes more fully automatic
Date Wed, 02 Dec 2009 08:04:20 GMT


Jaakko Laine commented on CASSANDRA-562:

Yeah, we can of course do that, at least for now when the operator is human. It's a bit pity
though that we provide commands that can potentially cause harm (although the odds are not
very big). In the future when the operator might be Cassandra herself, this is (I think) bigger
problem and needs to be solved one way or the other.

I continued a bit more on the 2-phase-commit path, and while the idea would seem to work,
there is at least one big problem: what to do if one of the nodes affected by range changes
is down? Since not all nodes are aware of all range changes, quorum permission is not enough
for 100% guarantee and we must get permission from all of them to be sure. However, if even
one of them is down, we're stuck and cannot move before the node either comes back online
or is manually removed. This is obviously not good, so in the end we have to relax the requirements
in any case, which would strongly suggest that my idea of having 2-phase-commit for leaving
was less than ideal in the first place :)

One "easy" way would be to add one more state as discussed in the "3rd option" above, namely
add ABOUT_TO_LEAVE state that would be gossiped before LEAVING. Node wanting to leave first
gossips ABOUT_TO_LEAVE, waits for ring delay, and checks if anything conflicts with its intentions
(namely ABOUT_TO_LEAVE or LEAVING from another node that would effect same ranges as this
node's move). If there are no conflicts, the node can proceed with leaving. If there are conflicts,
the node must return to NORMAL state and wait for random time before trying to leave again.
This would give ring delay time for the leave preparation info to propagate before the actual
move, which should be OK for almost all cases.

Of course this additional state would complicate things slightly, but would seem to be straightforward
enough and solve simultaneous leave problem for most cases. Good thing is this would not modify
token metadata in any way, only need storage service to remember what nodes are preparing
to leave.

I think it would be best to wait on this issue a bit, though. There are some gossiping modifications
that need to be done, so we may as well add this after those modifications if it is decided
such state will be added.

> Make range changes more fully automatic
> ---------------------------------------
>                 Key: CASSANDRA-562
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jaakko Laine
>            Priority: Minor
>             Fix For: 0.5
> I think there are currently some problems with bootstrapping & leaving. The inherent
problem is that a node one-sidedly announces that it is going to leave/take a token without
making sure it will not cause conflict, and we do not have proper mechanism to clean up after
a conflict. There are currently ways a simultaneous bootstrap or leaving can leave the cluster
(tokenmetadata) in an inconsistent state and we'll need either a mechanism to resolve which
operation wins or make sure only one operation is in process for affected ranges at one time.
> 1st option (resolve & clean conflicts as they happen):
> We could add local timestamp to bootstrap/leave gossip and resolve conflicts based on
that. This would allow us to choose one of the operations unambiguously and reject all but
the one that was first. Theoretically, if different data centers are not in perfect clock
sync, this might always favor one DC over the other in race situations, but this would hardly
create any noticeable bias. Problem is, this approach would probably end up being horrendously
complex (if not impossible) to do properly.
> 2nd option (make sure only one operation is in process at one time for affected ranges):
> Add new messaging "channel" to be used to agree beforehand who is going to move. (1)
Before a node can start bootstrapping, it must ask permission from the node whose range it
is going to bootstrap to. The request will be accepted if no other node is currently bootstrapping
there. Nodes of course check their own token metadata before bootstrapping, but this does
not guard against two nodes bootstrapping simultaneously (that is, before they see each others'
gossip). The only node able to answer this reliably is bootstrap source. (2) For leaving,
the node should first check with all nodes that are going to have pending ranges if it is
OK to leave. If it receives "OK" from all of them, then it can leave. If bootstrapping or
leaving operation is rejected, the node will wait for random time and start over. Eventually
all nodes will be able to bootstrap.
> Good in this approach would be that with a relatively small change (adding one messaging
exchange before actual move) we could make sure that all parties involved agree that the operation
is OK. This is IMHO also the "cleanest" way. Downside is that we will need token lease times
(how long the node owns "rights" to the token) and timers to make sure that we do not end
in a deadlock (or lock ranges) in case the node will not complete the operation. Network partitions,
delays and clock skews might create very difficult border cases to handle.
> 3rd option (another approach to preventing conflicts from happening):
> (3) We might take a simplistic approach and add two new states: ABOUT_TO_BOOTSTRAP and
ABOUT_TO_LEAVE. Whenever a node wants to bootstrap or leave, it will first gossip this state
with token info and wait for some time. After the wait, it will check if it is OK to carry
on with the operation (that is, it has not received any bootstrap/leaving gossip that would
contradict with its own plans). Also in this case, if the operation cannot be done due to
conflict, the node waits for random period and start over.
> This would be easiest to implement, but it would not completely remove the problem as
network partitions and other delays might still cause two nodes to clash without knowledge
of each others' intentions. Also, adding two more gossip states will expand the state machine
and make it more fragile.
> Don't know if I'm thinking about this in a wrong way, but to me it seems resolving conflicts
is very difficult, so the only option is to avoid them by some mechinism that makes sure only
one node is moving within affected ranges.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message