cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] Updated: (CASSANDRA-580) vector clock support
Date Mon, 16 Aug 2010 16:41:32 GMT


Sylvain Lebresne updated CASSANDRA-580:

    Attachment: 0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch

I believe the attached patch (580-version-vector-wip.patch) has a problem. At
CL.ZERO and CL.ONE, it doesn't replicate writes (the ones using version
vectors) at all (SP.updateDestinationByClock() clears the destinationEndpoints
but still returns an empty collection). This is (overly) unsafe.

This certainly could be fixed by adding new WriteResponseHandler for those
cases. But I believe that there is a *much* better alternative.

This alternative consists in changing the write protocol (for version vector
only of course) to do the following (and note that the protocol of the current
patch is already different of the one for timestamps):
  # a node receive a write request (with version vector clock) from a client.
     If it's a replica for the write, goto 3) otherwise goto 2)
  # the node delegate the write to one replica (along with the asked CL) and
    then only wait for a ack of this replica before answering the client (it
    doesn't replicate anything)
  # the chosen replica apply the mutation locally first (we must do it before
  # then it send the mutation to other replicates, waiting for how many
    responses are necessary to achieve asked consistency

To make this more concrete, I'm attaching a patch (0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch)
that implements this protocol (it all starts in SP.delegateMutateBlocking()).
Small disclaimers: this should work but is not really tested (so please be
nice :)). The function RowMutation.updateBeforeReplication() could safely be
ignored on a first read but it would be needed if #1072 was to use this. It
could also probably be slightly optimized by allowing the
DelegatedRowMutationVerbHandler to handle multiple mutations at once. This is
also just the protocol mentioned above, #580 would have to be rebased on top
of this.

Anyway, I think this alternative is superior to the one used by the currently
attached #580 patch for the following reasons:
  * the protocol used by the current patch (write to one replica, wait for the
    ack and then replicate to others, which differs from what I propose in
    that this is done from a potentially non replica node), doesn't work for
    #1072 (because of potential race condition with the read repairs). The
    protocol I'm proposing does not suffer of this problem and (I'm quite
    convinced, let's hope I'm not wrong) would reconciliate #1072 with the EC
    model of Cassandra. This is obviously the more important point.
  * it is slightly faster (network-latency-wise), as we don't wait for a full
    round-trip to a node before starting the replication.
  * it more cleanly separate the protocols of timestamped writes and versionned
    ones (without much code duplication really). I suppose this is more a
    matter of opinion whether this is better or not, but at the very least it
    make it clearer that version vectors don't slow down nor break the other

I'd be happy if someone had a look at this and confirm that I'm not
completely wide of the mark. If I'm not, I may be able to spare some cycle
merging this idea with #580 (and #1072).

> vector clock support
> --------------------
>                 Key: CASSANDRA-580
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>         Environment: N/A
>            Reporter: Kelvin Kakugawa
>            Assignee: Kelvin Kakugawa
>             Fix For: 0.7.0
>         Attachments: 0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch,
580-1-Add-ColumnType-as-enum.patch, 580-context-v4.patch, 580-counts-wip1.patch, 580-thrift-v3.patch,
580-thrift-v6.patch, 580-version-vector-wip.patch
>   Original Estimate: 672h
>  Remaining Estimate: 672h
> Allow a ColumnFamily to be versioned via vector clocks, instead of long timestamps. 
Purpose: enable incr/decr; flexible conflict resolution.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message