cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10413) Replaying materialized view updates from commitlog after node decommission crashes Cassandra
Date Wed, 07 Oct 2015 17:36:26 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947250#comment-14947250
] 

Joel Knighton commented on CASSANDRA-10413:
-------------------------------------------

cqlsh_tests.cqlsh_tests.TestCqlsh.test_pep8_compliance

https://github.com/apache/cassandra/commit/22099addaf6029656f8927ffb894c86c73bfaceb

I've pushed a branch at [t10413|https://github.com/jkni/cassandra/tree/t10413] that follows
Jake suggestions. Until a node has scheduled gossip, it will force a write through the batchlog
directly in MV.

A dtest run is available [here|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-t10413-dtest/2/].
I've checked against cassandra-3.0 dtest runs and there are no new failures; the PEP8 compliance
is fixed since the last rebase.

A testall run is available [here|http://cassci.datastax.com/view/Dev/view/jkni/job/jkni-t10413-testall/3/].
 There are two unit test failures that I can't find on cassandra-3.0 jobs: org.apache.cassandra.cql3.validation.entitites.TypeTest.testNowToUUIDCompatibility
and org.apache.cassandra.cql3.validation.operations.SelectTest.testSingleClustering. They
both fail with a fairly cryptic "Forked Java VM exited abnormally" that suggests something
environmental. Both tests looped locally do not show failures.

Ready for a look whenever [~tjake].

> Replaying materialized view updates from commitlog after node decommission crashes Cassandra
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10413
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10413
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Joel Knighton
>            Priority: Critical
>             Fix For: 3.0.0 rc2
>
>         Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test, runnable as
> {code}
> lein with-profile +trunk test :only cassandra.mv-test/mv-crash-subset-decommission
> {code}
> This test crashes/restarts nodes while decommissioning nodes. These actions are not coordinated.
> In [10164|https://issues.apache.org/jira/browse/CASSANDRA-10164], we introduced a change
to re-apply materialized view updates on commitlog replay.
> Some nodes, upon restart, will crash in commitlog replay. They throw the "Trying to get
the view natural endpoint on a non-data replica" runtime exception in getViewNaturalEndpoint.
I added logging to getViewNaturalEndpoint to show the results of replicationStrategy.getNaturalEndpoints
for the baseToken and viewToken.
> It can be seen that these problems occur when the baseEndpoints and viewEndpoints are
identical but do not contain the broadcast address of the local node.
> For example, a node at 10.0.0.5 crashes on replay of a write whose base token and view
token replicas are both [10.0.0.2, 10.0.0.4, 10.0.0.6]. It seems we try to guard against this
by considering pendingEndpoints for the viewToken, but this does not appear to be sufficient.
> I've attached the system.logs for a test run with added logging. In the attached logs,
n1 is at 10.0.0.2, n2 is at 10.0.0.3, and so on. 10.0.0.6/n5 is the decommissioned node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message