ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moti Nisenson-Ken (Jira)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-12133) O(log n) partition exchange
Date Mon, 02 Sep 2019 07:57:00 GMT
Moti Nisenson-Ken created IGNITE-12133:

             Summary: O(log n) partition exchange
                 Key: IGNITE-12133
                 URL: https://issues.apache.org/jira/browse/IGNITE-12133
             Project: Ignite
          Issue Type: Improvement
            Reporter: Moti Nisenson-Ken

Currently, partition exchange leverages a ring. This means that communications is O(n) in
number of nodes. It also means that if non-coordinator nodes hang it can take much longer
to successfully resolve the topology.

Instead, why not use something like a skip-list where the coordinator is first. The coordinator
can notify the first node at each level of the skip-list. Each node then notifies all of its
"near-neighbours" in the skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB)
<= max-level(nodeA), and nodeB is the first node at its level when traversing from nodeA
in the direction of nodeB, skipping over nodes C which have max-level(C) > max-level(A). 


1 .  .  .3

1        3 . .  . 5

1 . 2 . 3 . 4 . 5 . 6

In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 6, and
5 -> 6.

One can achieve better redundancy by having each node traverse in both directions, and having
the coordinator also notify the last node in the list at each level. This way in the above
example if 2 and 3 were both down, 4 would still get notified from 5 and 6 (in the backwards


The idea is that each individual node has O(log n) nodes to notify - so the overall time is
reduced. Additionally, we can deal well with at least 1 node failure - if one includes the
option of processing backwards, 2 consecutive node failures can be handled as well. By taking
this kind of an approach, then the coordinator can basically treat any nodes it didn't receive
a message from as not-connected, and update the topology as well (disconnecting any nodes
that it didn't get a notification from). While there are some edge cases here (e.g. 2 disconnected
nodes, then 1 connected node, then 2 disconnected nodes - the connected node would be wrongly
ejected from the topology), these would generally be too rare to need explicit handling for.

This message was sent by Atlassian Jira

View raw message