cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6648) Race condition during node bootstrapping
Date Tue, 04 Feb 2014 21:16:12 GMT


Brandon Williams commented on CASSANDRA-6648:

Thinking about this a bit more, I'm inclined to think that a) isFatClient should never have
checked epState.isAlive, since a fat client can be either alive or dead, and neither make
it more or less of a fat client, and thus b) onAlive is the wrong event for MM to be looking
at to decide on pulling schema, since potentially *every* node actually IS a fat client when
first seen.  The true source of 'fatclientness' or not is TMD.isMember, but SS hasn't processed
the onJoin event yet when onAlive is called.  We could possibly fix this by having isFatClient
check for the presence of TOKENS, which a fat client shouldn't have, or we could make SS.onJoin
trigger MM.maybeScheduleSchemaPull after it has processed the join event.

> Race condition during node bootstrapping
> ----------------------------------------
>                 Key: CASSANDRA-6648
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Critical
>         Attachments: 6648-v2.txt, CASSANDRA-6648.patch
> When bootstrapping a new node, data is "missing" as if the new node didn't actually bootstrap,
which I tracked down to the following scenario:
> 1) New node joins token ring and waits for schema to be settled before actually bootstrapping.
> 2) The schema scheck somewhat passes and it starts bootstrapping.
> 3) Bootstrapping doesn't find the ks/cf that should have received from the other node.
> 4) Queries at this point cause NPEs, until when later they "recover" but data is missed.
> The problem seems to be caused by a race condition between the migration manager and
the bootstrapper, with the former running after the latter.
> I think this is supposed to protect against such scenarios:
> {noformat}
>             while (!MigrationManager.isReadyForBootstrap())
>             {
>                 setMode(Mode.JOINING, "waiting for schema information to complete", true);
>                 Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS);
>             }
> {noformat}
> But MigrationManager.isReadyForBootstrap() implementation is quite fragile and doesn't
take into account "slow" schema propagation.

This message was sent by Atlassian JIRA

View raw message