ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Rakov <ivan.glu...@gmail.com>
Subject Re: Metric showing how many nodes may safely leave the cluster
Date Fri, 04 Oct 2019 14:09:07 GMT

What if user simply don't have configured monitoring system?
Knowing whether cluster will survive node shutdown is critical for any 
administrator that performs any manipulations with cluster topology.
Essential information should be easily accessed. We shouldn't force 
users to configure external tools and write extra code for basic things.


Thanks, that's exact metric we need.
My point is that we should make it more accessible: via control.sh 
command and single method for the whole cluster.

Best Regards,
Ivan Rakov

On 04.10.2019 16:34, Alex Plehanov wrote:
> Ivan, there already exist metric
> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the
> current redundancy level for the cache group.
> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data
> loss in this cache group.
> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glukos@gmail.com>:
>> Igniters,
>> I've seen numerous requests to find out an easy way to check whether is
>> it safe to turn off cluster node. As we know, in Ignite protection from
>> sudden node shutdown is implemented through keeping several backup
>> copies of each partition. However, this guarantee can be weakened for a
>> while in case cluster has recently experienced node restart and
>> rebalancing process is still in progress.
>> Example scenario is restarting nodes one by one in order to update a
>> local configuration parameter. User restarts one node and rebalancing
>> starts: when it will be completed, it will be safe to proceed (backup
>> count=1). However, there's no transparent way to determine whether
>> rebalancing is over.
>>   From my perspective, it would be very helpful to:
>> 1) Add information about rebalancing and number of free-to-go nodes to
>> ./control.sh --state command.
>> Examples of output:
>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>> Cluster tag: new_tag
>> --------------------------------------------------------------------------------
>>> Cluster is active
>>> All partitions are up-to-date.
>>> 3 node(s) can safely leave the cluster without partition loss.
>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>> Cluster tag: new_tag
>> --------------------------------------------------------------------------------
>>> Cluster is active
>>> Rebalancing is in progress.
>>> 1 node(s) can safely leave the cluster without partition loss.
>> 2) Provide the same information via ClusterMetrics. For example:
>> ClusterMetrics#isRebalanceInProgress // boolean
>> ClusterMetrics#getSafeToLeaveNodesCount // int
>> Here I need to mention that this information can be calculated from
>> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
>> still think that we need more simple and understandable flag whether
>> cluster is in danger of data loss. Another point is that current metrics
>> are bound to specific cache, which makes this information even harder to
>> analyze.
>> Thoughts?
>> --
>> Best Regards,
>> Ivan Rakov

View raw message