Github user belliottsmith commented on a diff in the pull request:
https://github.com/apache/cassandra/pull/265#discussion_r216916456
--- Diff: src/java/org/apache/cassandra/locator/ReplicaLayout.java ---
@@ -194,14 +198,97 @@ public Token token()
public static ReplicaLayout.ForTokenWrite forTokenWrite(EndpointsForToken natural,
EndpointsForToken pending)
{
- if (Endpoints.haveConflicts(natural, pending))
+ if (haveWriteConflicts(natural, pending))
{
- natural = Endpoints.resolveConflictsInNatural(natural, pending);
- pending = Endpoints.resolveConflictsInPending(natural, pending);
+ natural = resolveWriteConflictsInNatural(natural, pending);
+ pending = resolveWriteConflictsInPending(natural, pending);
}
return new ReplicaLayout.ForTokenWrite(natural, pending);
}
+ /**
+ * Detect if we have any endpoint in both pending and full; this can occur either
due to races (there is no isolation)
+ * or because an endpoint is transitioning between full and transient replication
status.
+ *
+ * We essentially always prefer the full version for writes, because this is stricter.
+ *
+ * For transient->full transitions:
+ *
+ * Since we always write to any pending transient replica, effectively upgrading
it to full for the transition duration,
+ * it might at first seem to be OK to continue treating the conflict replica as
its 'natural' transient form,
+ * as there is always a quorum of nodes receiving the write. However, ring ownership
changes are not atomic or
+ * consistent across the cluster, and it is possible for writers to see different
ring states.
+ *
+ * Furthermore, an operator would expect that the full node has received all writes,
with no extra need for repair
+ * (as the normal contract dictates) when it completes its transition.
+ *
+ * While we cannot completely eliminate risks due to ring inconsistencies, this
approach is the most conservative
+ * available to us today to mitigate, and (we think) the easiest to reason about.
+ *
+ * For full->transient transitions:
+ *
+ * In this case, things are dicier, because in theory we can trigger this change
instantly. All we need to do is
+ * drop some data, surely?
+ *
+ * Ring movements can put is in a pickle; any other node could believe us to be
full when we have become transient,
+ * and perform a full data request to us that we believe ourselves capable of answering,
but that we are not.
+ * If the ring is inconsistent, it's even feasible that a transient request would
be made to the node that is losing
+ * its transient status, that also does not know it has yet done so, resulting
in all involved nodes being unaware
+ * of the data inconsistency.
+ *
+ * This happens because ring ownership changes are implied by a single node; not
all owning nodes get a say in when
+ * the transition takes effect. As such, a node can hold an incorrect belief about
its own ownership ranges.
+ *
+ * This race condition is somewhat inherent in present day Cassandra, and there's
actually a limit to what we can do about it.
+ * It is a little more dangerous with transient replication, however, because we
can completely answer a request without
+ * ever touching a digest, meaning we are less likely to attempt to repair any
inconsistency.
+ *
+ * We aren't guaranteed to contact any different nodes for the data requests, of
course, though we at least have a chance.
+ *
+ * Note: If we have any pending transient->full movement, we need to move the
full replica to our 'natural' bucket
+ * to avoid corrupting our count. This is fine for writes, all we're doing is ensuring
we always write to the node,
+ * instead of selectively.
+ *
+ * @param natural
+ * @param pending
+ * @param <E>
+ * @return
+ */
+ public static <E extends Endpoints<E>> boolean haveWriteConflicts(E natural,
E pending)
+ {
+ Set<InetAddressAndPort> naturalEndpoints = natural.endpoints();
+ for (InetAddressAndPort pendingEndpoint : pending.endpoints())
+ {
+ if (naturalEndpoints.contains(pendingEndpoint))
+ return true;
+ }
+ return false;
+ }
+
+ /**
+ * MUST APPLY FIRST
+ * See {@link ReplicaLayout#haveWriteConflicts}
+ * @return a 'natural' replica collection, that has had its conflicts with pending
repaired
+ */
+ public static <E extends Endpoints<E>> E resolveWriteConflictsInNatural(E
natural, E pending)
+ {
+ return natural.filter(r -> !r.isTransient() || !pending.contains(r.endpoint(),
true));
+ }
+
+ /**
+ * MUST APPLY SECOND
+ * See {@link ReplicaLayout#haveWriteConflicts}
+ * @return a 'pending' replica collection, that has had its conflicts with natural
repaired
+ */
+ public static <E extends Endpoints<E>> E resolveWriteConflictsInPending(E
natural, E pending)
--- End diff --
Not anymore
---
---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org
|