cassandra-pr mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aweisberg <...@git.apache.org>
Subject [GitHub] cassandra pull request #265: 14705
Date Mon, 10 Sep 2018 21:57:05 GMT
Github user aweisberg commented on a diff in the pull request:

    https://github.com/apache/cassandra/pull/265#discussion_r216486027
  
    --- Diff: src/java/org/apache/cassandra/locator/ReplicaLayout.java ---
    @@ -194,14 +198,97 @@ public Token token()
     
         public static ReplicaLayout.ForTokenWrite forTokenWrite(EndpointsForToken natural,
EndpointsForToken pending)
         {
    -        if (Endpoints.haveConflicts(natural, pending))
    +        if (haveWriteConflicts(natural, pending))
             {
    -            natural = Endpoints.resolveConflictsInNatural(natural, pending);
    -            pending = Endpoints.resolveConflictsInPending(natural, pending);
    +            natural = resolveWriteConflictsInNatural(natural, pending);
    +            pending = resolveWriteConflictsInPending(natural, pending);
             }
             return new ReplicaLayout.ForTokenWrite(natural, pending);
         }
     
    +    /**
    +     * Detect if we have any endpoint in both pending and full; this can occur either
due to races (there is no isolation)
    +     * or because an endpoint is transitioning between full and transient replication
status.
    +     *
    +     * We essentially always prefer the full version for writes, because this is stricter.
    +     *
    +     * For transient->full transitions:
    +     *
    +     *   Since we always write to any pending transient replica, effectively upgrading
it to full for the transition duration,
    +     *   it might at first seem to be OK to continue treating the conflict replica as
its 'natural' transient form,
    +     *   as there is always a quorum of nodes receiving the write.  However, ring ownership
changes are not atomic or
    +     *   consistent across the cluster, and it is possible for writers to see different
ring states.
    +     *
    +     *   Furthermore, an operator would expect that the full node has received all writes,
with no extra need for repair
    +     *   (as the normal contract dictates) when it completes its transition.
    +     *
    +     *   While we cannot completely eliminate risks due to ring inconsistencies, this
approach is the most conservative
    +     *   available to us today to mitigate, and (we think) the easiest to reason about.
    +     *
    +     * For full->transient transitions:
    +     *
    +     *   In this case, things are dicier, because in theory we can trigger this change
instantly.  All we need to do is
    +     *   drop some data, surely?
    +     *
    +     *   Ring movements can put is in a pickle; any other node could believe us to be
full when we have become transient,
    +     *   and perform a full data request to us that we believe ourselves capable of answering,
but that we are not.
    +     *   If the ring is inconsistent, it's even feasible that a transient request would
be made to the node that is losing
    +     *   its transient status, that also does not know it has yet done so, resulting
in all involved nodes being unaware
    +     *   of the data inconsistency.
    +     *
    +     *   This happens because ring ownership changes are implied by a single node; not
all owning nodes get a say in when
    +     *   the transition takes effect.  As such, a node can hold an incorrect belief about
its own ownership ranges.
    +     *
    +     *   This race condition is somewhat inherent in present day Cassandra, and there's
actually a limit to what we can do about it.
    +     *   It is a little more dangerous with transient replication, however, because we
can completely answer a request without
    +     *   ever touching a digest, meaning we are less likely to attempt to repair any
inconsistency.
    +     *
    +     *   We aren't guaranteed to contact any different nodes for the data requests, of
course, though we at least have a chance.
    +     *
    +     * Note: If we have any pending transient->full movement, we need to move the
full replica to our 'natural' bucket
    +     * to avoid corrupting our count.  This is fine for writes, all we're doing is ensuring
we always write to the node,
    +     * instead of selectively.
    +     *
    +     * @param natural
    +     * @param pending
    +     * @param <E>
    +     * @return
    +     */
    +    public static <E extends Endpoints<E>> boolean haveWriteConflicts(E natural,
E pending)
    +    {
    +        Set<InetAddressAndPort> naturalEndpoints = natural.endpoints();
    +        for (InetAddressAndPort pendingEndpoint : pending.endpoints())
    +        {
    +            if (naturalEndpoints.contains(pendingEndpoint))
    +                return true;
    +        }
    +        return false;
    +    }
    +
    +    /**
    +     * MUST APPLY FIRST
    +     * See {@link ReplicaLayout#haveWriteConflicts}
    +     * @return a 'natural' replica collection, that has had its conflicts with pending
repaired
    +     */
    +    public static <E extends Endpoints<E>> E resolveWriteConflictsInNatural(E
natural, E pending)
    +    {
    +        return natural.filter(r -> !r.isTransient() || !pending.contains(r.endpoint(),
true));
    +    }
    +
    +    /**
    +     * MUST APPLY SECOND
    +     * See {@link ReplicaLayout#haveWriteConflicts}
    +     * @return a 'pending' replica collection, that has had its conflicts with natural
repaired
    +     */
    +    public static <E extends Endpoints<E>> E resolveWriteConflictsInPending(E
natural, E pending)
    --- End diff --
    
    Do these need to be public?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org


Mime
View raw message