cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10231) Null status entries on nodes that crash during decommission of a different node
Date Fri, 09 Oct 2015 03:03:26 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949812#comment-14949812
] 

Stefania commented on CASSANDRA-10231:
--------------------------------------

I've should have read this before commenting on CASSANDRA-10089, I left a note there to move
the discussion here.

I think you're correct: we'll end up with stale entries if we populate the token metadata
before recovering the commit log and some entries were previously deleted but not yet flushed.
So if we must populate the token metadata before commit log replay (cc [~krummas] regarding
CASSANDRA-6696), then we have no other choice but to force a blocking flush when we delete
entries in system {{PEERS}}. At this point I would suggest to be consistent and force a blocking
flush in {{updateTokens}} as well. Note that this is already done in {{updatePreferredIP}}
so we are not introducing something totally new. 

In an ideal word, I'd say we should not rely on the content of any tables (system or not)
before recovering the commit log but if this is not possible I guess we have to be pragmatic.

Really well done on deducing this by the way!

> Null status entries on nodes that crash during decommission of a different node
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10231
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10231
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Stefania
>             Fix For: 3.0.0 rc2
>
>         Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that crashes and
decommissions nodes throughout the test.
> In a 5 node cluster, if a node crashes at a certain point (unknown) during the decommission
of a different node, it may start with a null entry for the decommissioned node like so:
> DN 10.0.0.5 ? 256 ? null rack1
> This entry does not get updated/cleared by gossip. This entry is removed upon a restart
of the affected node.
> This issue is further detailed in ticket [10068|https://issues.apache.org/jira/browse/CASSANDRA-10068].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message