cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-5351) Avoid repairing already-repaired data by default
Date Tue, 15 Oct 2013 12:28:52 GMT


Jason Brown commented on CASSANDRA-5351:

Interesting ideas here. However, here's some problems off the top of my head that need to
be addressed (in no order):

* Nodes that are fully replaced (see: Netflix running in the cloud). When a node is replaced,
we bootstrap the node by streaming data from closest peers (usually) in the local DC. The
new node would not have anti-compacted sstables, as it's never had a chance to repair. I'm
not sure if bootstrapping data can be considered anti-compacted through cummutativity; it
might be true, but I'd need to think about it more. Asuuming not, when this new node is involved
in any repair, it would generate a different MT than it's already repaired peers, and thus
all hell would break loose streaming already repaired data to every node involved in the repair,
worse than today's repair (think streaming TBs of data across multiple amazon datacenters).
If we can prove that the new node's data is commutatively repaired just by bootstrap, then
this is not a problem as such. Note this also affects move (to a lesser degree) and rebuild.
* Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails to repair
with them (due to partitioning, app crash, etc) during the repair. C is forced to do an -ipr
repair as A and B have already anti-compacted and that is the only way C will be able to repair
against A and B. 
* If the operator chooses to cacncel the repair, we are left at an indetermant state wrt which
node has successfully completed repairs with another (similar to last point).
* Local DC repair vs. global is largely incompatible with this. Looks like you will get one
shot with each sstable's range for repair, so if you choose do local DC repair with an ssttable,
you are forced to do -ipr if you later want to globally repair.

Note that these problems are magnified immensely when you run in multiple datacenters, especially
datacenters separated by great distances.

While none of these situations is unresolvable, it seems that there are many non-obvious ways
into which we can get into a non-deterministic state that operators will see either tons of
data being streamed due to different anti-compaction points being different or will be forced
to run -ipr without an easily understood reason. I already see operators terminate repair
jobs because "they hang" or "take too long", for better or worse (mostly worse). At that point,
the operator is pretty much required to do an -ipr repair, which gets us back into the same
situation we are in today, but with more confusion and possibly using -ipr as the default.

It would probably be good to run -ipr as a best practice anyways every n days/weeks/months,
but I worry about the very non-obvious edge cases this introduces and the possiblity that
operators will simply fall back to using -ipr whenever something goes bump or doesn't make

Thanks for listening.

> Avoid repairing already-repaired data by default
> ------------------------------------------------
>                 Key: CASSANDRA-5351
>                 URL:
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Lyuben Todorov
>              Labels: repair
>             Fix For: 2.1
> Repair has always built its merkle tree from all the data in a columnfamily, which is
guaranteed to work but is inefficient.
> We can improve this by remembering which sstables have already been successfully repaired,
and only repairing sstables new since the last repair.  (This automatically makes CASSANDRA-3362
much less of a problem too.)
> The tricky part is, compaction will (if not taught otherwise) mix repaired data together
with non-repaired.  So we should segregate unrepaired sstables from the repaired ones.

This message was sent by Atlassian JIRA

View raw message