cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
Date Wed, 06 Apr 2016 14:20:25 GMT


Paulo Motta commented on CASSANDRA-8523:

There are two scenarios we should consider when replacing a node:
1) The replacing node has the same IP as the previous node
2) The replacing node has a different IP as the previous node

On CASSANDRA-9244 I have gotten pretty far in an [implementation|]
that adds a new non-dead gossip state {{BOOT_REPLACE}} and considers the replacing endpoint
as a bootstrapping pending endpoint, solving case 2 transparently.

Case 1 is trickier because when the replacing node enters gossip with a non-dead state, other
nodes will think the previous node is back up and send reads to him (since he is a natural

A simple way to solve this is to special-case the read path and ignore nodes in "NON-NORMAL"
state when sending reads to natural endpoints. While this will probably solve the problem,
there are quite a few different paths we need to hack to make sure this is enforced correctly
(paxos, read, hints, etc), so I'm not totally comfortable with that.

A more transparent but a bit costlier approach to solve case 1 would be to change the {{TokenMetadata}}
to keep nodes as {{(InetAddress, UUID)}} pairs, and create a new interface to the {{FailureDetector}}
indexed by {{UUID}}. This way we could keep {{(IP=,UUID=1)}} in {{TokenMetadata}}
as a natural endpoint, and add a replacement node {{(IP=,UUID=2)}} as a pending endpoint.
So, during reads, {{FD.isAlive(UUID=1)}} would return false, and natural reads would not be
sent to {{(IP=,UUID=1)}}, while pending writes would be sent to {{(IP=,UUID=2)}}
because {{FD.isAlive(UUID=2)}} would return true.

I'd be happy to continue working on this, so feedback on any of the above or alternative approaches
would be greatly appreciated.

> Writes should be sent to a replacement node while it is streaming in data
> -------------------------------------------------------------------------
>                 Key: CASSANDRA-8523
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Richard Wagner
>            Assignee: Brandon Williams
>             Fix For: 2.1.x
> In our operations, we make heavy use of replace_address (or replace_address_first_boot)
in order to replace broken nodes. We now realize that writes are not sent to the replacement
nodes while they are in hibernate state and streaming in data. This runs counter to what our
expectations were, especially since we know that writes ARE sent to nodes when they are bootstrapped
into the ring.
> It seems like cassandra should arrange to send writes to a node that is in the process
of replacing another node, just like it does for a nodes that are bootstraping. I hesitate
to phrase this as "we should send writes to a node in hibernate" because the concept of hibernate
may be useful in other contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period makes subsequent
repairs more expensive, proportional to the number of writes that we miss (and depending on
the amount of data that needs to be streamed during replacement and the time it may take to
rebuild secondary indexes, we could miss many many hours worth of writes). It also leaves
us more exposed to consistency violations.

This message was sent by Atlassian JIRA

View raw message