cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3829) make seeds *only* be seeds, not special in gossip
Date Fri, 03 Feb 2012 20:31:55 GMT


Brandon Williams commented on CASSANDRA-3829:

bq. Instead of saying "seeds are absolutely only used on initial bootstrap", we make it "seeds
are also considered after every start-up, until at least a single gossip round has happened
successfully with the seed in question".
bq. This should retain, I think, the healing properties we have now with respect to nodes
re-starting after having been down during topology changes (but unfortunately retains the
requirement that a human keeps the seed list up to date at all times, and not just when adding

If a human still has to maintain the seed list, what does this buy us over keeping things
the way they are?
> make seeds *only* be seeds, not special in gossip 
> --------------------------------------------------
>                 Key: CASSANDRA-3829
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
> First, a little bit of "framing" on how seeds work:
> The concept of "seed hosts" makes fundamental sense; you need to
> "seed" a new node with some information required in order to join a
> cluster. Seed hosts is the information Cassandra uses for this
> purpose.
> But seed hosts play a role even after the initial start-up of a new
> node in a ring. Specifically, seed hosts continue to be gossiped to
> separately by the Gossiper throughout the life of a node and the
> cluster.
> Generally, operators must be careful to ensure that all nodes in a
> cluster are appropriately configured to refer to an overlapping set of
> seed hosts. Strictly speaking this should not be necessary (see
> further down though), but is the general recommendation. An
> unfortunate side-effect of this is that whenever you are doing ring
> management, such as replacing nodes, removing nodes, etc, you have to
> keep in mind which nodes are seeds.
> For example, if you bring a new node into the cluster, doing
> everything right with token assignment and auto_bootstrap=true, it
> will just enter the cluster without bootstrap - causing inconsistent
> reads. This is dangerous.
> And worse - changing the notion of which nodes are seeds across a
> cluster requires a *rolling restart*. It can be argued that it should
> actually be okay for nodes other than the one being fiddled with to
> incorrectly treat the fiddled-with node as a seed node, but this fact
> is highly opaque to most users that are not intimately familiar with
> Cassandra internals.
> This adds additional complexity to operations, as it introduces a
> reason why you cannot view the ring as completely homogeneous, despite
> the fundamental idea of Cassandra that all nodes should be equal.
> Now, fast forward a bit to what we are doing over here to avoid this
> problem: We have a zookeeper based systems for keeping track of hosts
> in a cluster, which is used by our Cassandra client to discover nodes
> to talk to. This works well.
> In order to avoid the need to manually keep track of seeds, we wanted
> to make seeds be automatically discoverable in order to eliminate as
> an operational concern. We have implemented a seed provider that does
> this for us, based on the data we keep in zookeeper.
> We could see essentially three ways of plugging this in:
> * (1) We could simply rely on not needing overlapping seeds and grab whatever we have
when a node starts.
> * (2) We could do something like continually treat all other nodes as seeds by dynamically
changing the seed list (involves some other changes like having the Gossiper update it's notion
of seeds.
> * (3) We could completely eliminate the use of seeds *except* for the very specific purpose
of initial start-up of an unbootstrapped node, and keep using a static (for the duration of
the node's uptime) seed list.
> (3) was attractive because it felt like this was the original intent
> of seeds; that they be used for *seeding*, and not be constantly
> required during cluster operation once nodes are already joined.
> Now before I make the suggestion, let me explain how we are currently
> (though not yet in production) handling seeds and start-up.
> First, we have the following relevant cases to consider during a normal start-up:
> * (a) we are starting up a cluster for the very first time
> * (b) we are starting up a new clean node in order to join it to a pre-existing cluster
> * (c) we are starting up a pre-existing already joined node in a pre-existing cluster
> First, we proceeded on the assumption that we wanted to remove the use
> of seeds during regular gossip (other than on initial startup). This
> means that for the (c) case, we can *completely* ignore seeds. We
> never even have to discover the seed list, or if we do, we don't have
> to use them.
> This leaves (a) and (b). In both cases, the critical invariant we want
> to achieve is that we must have one or more *valid* seeds (valid means
> for (b) that the seed is in the cluster, and for (a) that it is one of
> the nodes that are part of the initial cluster setup).
> In the (c) case the problem is trivial - ignore seeds.
> In the (a) case, the algorithm is:
> * Register with zookeeper as a seed
> * Wait until we see *at least one* seed *other than ourselves* in zookeeper
> * Continue regular start-up process with the seed list (with 1 or more seeds)
> In the (b) case, the algorithm is:
> * Wait until we see *at least one* seed in zookeeper
> * Continue regular start-up process with the seed list (with 1 or more seeds)
> * Once fully up (around the time we listen to thrift), register as a seed in zookeeper
> With the annoyance that you have to explicitly let Cassandra know that
> "I am starting a cluster for the very first time from scratch", and
> ignoring the problem of single node clusters (just to avoid
> complicating this post further), this guarantees in both cases that
> all nodes eventually see each other.
> In the (a) case, all nodes except one are guaranteed to see the "one"
> node. The "one" node is guaranteed to see one of the others. Thus -
> convergence.
> In the (b) case, it's simple - the new node is guaranteed to see one
> or more nodes that are in the cluster - convergence.
> The current status is that we have implemented the seed provider and
> the start-up sequence works. But in order to simplify Cassandra (and
> to avoid having to diverge), we propose that we take this to its
> conclusion and officially make seeds only relevant on start-up, by
> only ever gossiping to seeds when in pre-bootstrap mode during
> start-up.
> The perceived benefits are:
> * Simplicity for the operator. All nodes are equal once joined; you can almost forget
completely about seeds.
> * No rolling restarts or potential for footshooting a node into a cluster without bootstrap
because it happened to be a seed.
> * Production clusters will suddenly start to actually *test* the gossip protocol without
relying on seeds. How sure are we that it even works, and that phi conviction is appropriate
and RING_DELAY is appropriate, given that practical clusters tend to gossip to a random (among
very few) seeds? This change would make it so that we *always* gossip randomly to anyone in
the cluster, and there should be no danger that a cluster happens to hold together because
seeds are up - only to explode when they are not.
> * It eliminates non-trivial concerns with automatic seed discover, particularly when
you want that seed discovery to be rack and DC aware. All you care about it what was described
above; if that seed happens to fail, we simply fail to find the cluster and can abort start-up
and it can be retried. There is no need for "redundancy" in seeds.
> Thoughts? Are seeds important (by design) in some way other than for seeding? What do
other people think about the implications of RING_DELAY etc?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message