ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Kovalenko (Jira)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-12255) Cache affinity fetching and calculation on client nodes may be broken in some cases
Date Thu, 03 Oct 2019 10:10:00 GMT
Pavel Kovalenko created IGNITE-12255:

             Summary: Cache affinity fetching and calculation on client nodes may be broken
in some cases
                 Key: IGNITE-12255
                 URL: https://issues.apache.org/jira/browse/IGNITE-12255
             Project: Ignite
          Issue Type: Bug
          Components: cache
    Affects Versions: 2.7, 2.5
            Reporter: Pavel Kovalenko
            Assignee: Pavel Kovalenko
             Fix For: 2.8

We have a cluster with server and client nodes.
We dynamically start several caches on a cluster.
Periodically we create and destroy some temporary cache in a cluster to move up cluster topology
At the same time, a random client node chooses a random existing cache and performs operations
on that cache.
It leads to an exception on client node that affinity is not initialized for a cache during
cache operation like:
Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]

This exception means that the last affinity for a cache is calculated on version [8,2]. This
is a cache start version. It happens because during creating/destroying some temporary cache
we don’t re-calculate affinity for all existing but not already accessed caches on client
nodes. Re-calculate in this case is cheap - we just copy affinity assignment and increment
topology version.

As a solution, we need to fetch affinity on client node join for all caches. Also, we need
to re-calculate affinity for all affinity holders (not only for started caches or only configured
caches) for all topology events that happened in a cluster on a client node.

This solution showed the existing race between client node join and concurrent cache destroy.

The race is the following:

Client node (with some configured caches) joins to a cluster sending SingleMessage to coordinator
during client PME. This SingleMessage contains affinity fetch requests for all cluster caches.
When SingleMessage is in-flight server nodes finish client PME and also process and finish
cache destroy PME. When a cache is destroyed affinity for that cache is cleared. When SingleMessage
delivered to coordinator it doesn’t have affinity for a requested cache because the cache
is already destroyed. It leads to assertion error on the coordinator and unpredictable behavior
on the client node.

The race may be fixed with the following change:

If the coordinator doesn’t have an affinity for requested cache from the client node, it
doesn’t break PME with assertion error, just doesn’t send affinity for that cache to a
client node. When the client node receives FullMessage and sees that affinity for some requested
cache doesn’t exist, it just closes cache proxy for user interactions which throws CacheStopped
exception for every attempt to use that cache. This is safe behavior because cache destroy
event should be happened on the client node soon and destroy that cache completely.

This message was sent by Atlassian Jira

View raw message