cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10231) Null status entries on nodes that crash during decommission of a different node
Date Fri, 09 Oct 2015 20:12:05 GMT


Joel Knighton commented on CASSANDRA-10231:

I think the force blocking flush approach behavior is the least invasive and most likely to
ensure correctness.

With log entries, I've confirmed that my suspected behavior occurs.  Before commitlog replay,
we {{populateTokenMetadata}} for node1, node2, and node3.  After commitlog replay, when we
{{populateTokenMetadata}}, we only consider node2 and node3.  node1 stays present in the {{tokenMetadata}}.

I pushed a branch [10231-alternate|]
with a {{forceBlockingFlush}} only in {{removeEndpoint}}. I'll create a follow-up ticket to
further discuss the use of {{forceBlockingFlush}} for other {{PEERS}}-related methods in SystemKeyspace.

In CI, there are no unit test failures out of the ordinary.

In CI, there is only one dtest failure outside of historically flappy tests/tests with known
This failure is in {{commitlog_test.TestCommitLog.stop_failure_policy_test}} and is reproducible
locally. In the original patch, upon commitlog failure, when gossip was shutdown, we would
notify {{onChange}} which in {{handleStateNormal}} would {{updateTokens}} for the local node,
which would call {{removeEndpoint}}, causing the thread to hang in {{forceBlockingFlush}}
(due to the aforementioned commitlog failure).

Looking at git history, it seems this {{removeEndpoint}} is precautionary and there is currently
no gossip transition that results in the local node being present in {{PEERS}}. As a result,
I've removed this call from {{updateTokens}}, so the above commitlog test passes. This commit
has been pushed to the branch [10231-alternate|].

I'm waiting for CI to finish for this change; in the meantime, any feedback or review would
be great.

> Null status entries on nodes that crash during decommission of a different node
> -------------------------------------------------------------------------------
>                 Key: CASSANDRA-10231
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Joel Knighton
>             Fix For: 3.0.0 rc2
>         Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
> This issue is reproducible through a Jepsen test of materialized views that crashes and
decommissions nodes throughout the test.
> In a 5 node cluster, if a node crashes at a certain point (unknown) during the decommission
of a different node, it may start with a null entry for the decommissioned node like so:
> DN ? 256 ? null rack1
> This entry does not get updated/cleared by gossip. This entry is removed upon a restart
of the affected node.
> This issue is further detailed in ticket [10068|].

This message was sent by Atlassian JIRA

View raw message