ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Rakov <ivan.glu...@gmail.com>
Subject Metric showing how many nodes may safely leave the cluster
Date Fri, 04 Oct 2019 13:09:44 GMT
Igniters,

I've seen numerous requests to find out an easy way to check whether is 
it safe to turn off cluster node. As we know, in Ignite protection from 
sudden node shutdown is implemented through keeping several backup 
copies of each partition. However, this guarantee can be weakened for a 
while in case cluster has recently experienced node restart and 
rebalancing process is still in progress.
Example scenario is restarting nodes one by one in order to update a 
local configuration parameter. User restarts one node and rebalancing 
starts: when it will be completed, it will be safe to proceed (backup 
count=1). However, there's no transparent way to determine whether 
rebalancing is over.
 From my perspective, it would be very helpful to:
1) Add information about rebalancing and number of free-to-go nodes to 
./control.sh --state command.
Examples of output:

> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> Cluster tag: new_tag
> --------------------------------------------------------------------------------
> Cluster is active
> All partitions are up-to-date.
> 3 node(s) can safely leave the cluster without partition loss.
> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> Cluster tag: new_tag
> --------------------------------------------------------------------------------
> Cluster is active
> Rebalancing is in progress.
> 1 node(s) can safely leave the cluster without partition loss.
2) Provide the same information via ClusterMetrics. For example:
ClusterMetrics#isRebalanceInProgress // boolean
ClusterMetrics#getSafeToLeaveNodesCount // int

Here I need to mention that this information can be calculated from 
existing rebalance metrics (see CacheMetrics#*rebalance*). However, I 
still think that we need more simple and understandable flag whether 
cluster is in danger of data loss. Another point is that current metrics 
are bound to specific cache, which makes this information even harder to 
analyze.

Thoughts?

-- 
Best Regards,
Ivan Rakov


Mime
View raw message