ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ткаленко кирилл <tkalkir...@yandex.ru>
Subject Extended logging for rebalance performance analysis
Date Mon, 04 May 2020 07:38:09 GMT
Hi, Igniters!

I'd like to share a new small feature in AI [1].

Current rebalance logging does not allow you to quickly answer following questions:
1)How long was the balance(divided by supplier)? 
2)How many records and bytes per supplier were rebalanced?
3)How many times did rebalance restart?
4)Which partitions were rebalanced and from which nodes did they receive them?
5)When did rebalance for all cache groups end?

What you can see in logs now:

1)Starting rebalance with order of cache groups.
Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], top=AffinityTopologyVersion [topVer=2,
minorTopVer=0], force=false, evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]

2)Start rebalance of cache group from a specific supplier, specifying partition ids and mode
- historical or full.
Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99], histPartitions=[]]

3)Getting partial or complete partitions of cache group.
Completed rebalancing [grp=ignite-sys-cache, supplier=8c525892-703b-4fc4-b28b-b2f139700000,
topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
Completed (final) rebalancing [grp=ignite-sys-cache, supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]

4)End rebalance of cache group.
Completed rebalance future: RebalanceFuture [grp=CacheGroupContext [grp=ignite-sys-cache],
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1, lastCancelledTime=-1]

Rebalance statistics:

To speed up rebalance analysis, statistics will be output for each cache group and total for
all cache groups. 
If duration rebalance for cache group is greater than threshold value, partition distribution
is output.
Statistics will you to analyze duration of the balance for each supplier to understand which
of them has been transmitting data for longest time.

System properties are used to output statistics:

IGNITE_QUIET - to output statistics, value must be false;
IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold duration rebalance of
cache group after which partitions distribution is output, set in milliseconds, default value
is 10 minutes.

Statistics examples:

Successful full and historical rebalance of group cache, without partitions distribution.
Rebalance information per cache group (successful rebalance): [id=3181548, name=grp1, startTime=2020-04-13
10:55:16,117, finishTime=2020-04-13 10:55:16,127, d=10 ms, restarted=0] Supplier statistics:
[nodeId=0, p=5, d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b
- bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
[1=rebalancing.RebalanceStatisticsTest0] 
Rebalance information per cache group (successful rebalance): [id=3181547, name=grp0, startTime=2020-04-13
15:01:44,000, finishTime=2020-04-13 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
[nodeId=0, hp=10, he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - bytes,
d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]

Successful full and historical rebalance of group cache, with partitions distribution.
Rebalance information per cache group (successful rebalance): [id=3181548, name=grp1, startTime=2020-04-13
10:55:16,117, finishTime=2020-04-13 10:55:16,127, d=10 ms, restarted=0] Supplier statistics:
[nodeId=0, p=5, d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b
- bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
[1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater than 5 ms, printing
detailed information about partitions distribution (threshold can be changed by setting number
of milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 = [0,bu,su],[1,bu],[2,pr,su]
1 = [0,bu,su],[1,bu],[2,pr,su] 2 = [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su]
4 = [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 = [0,bu,su],[1,bu],[2,pr,su]
7 = [0,bu,su],[1,bu],[2,pr,su] 8 = [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su]
Aliases: pr - primary, bu - backup, su - supplier node, h - historical, nodeId mapping (nodeId=id,consistentId)
[0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest2] [2=rebalancing.RebalanceStatisticsTest0] 
Rebalance information per cache group (successful rebalance): [id=3181547, name=grp0, startTime=2020-04-13
15:01:44,000, finishTime=2020-04-13 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
[nodeId=0, hp=10, he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - bytes,
d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
Rebalance duration was greater than 100 ms, printing detailed information about partitions
distribution (threshold can be changed by setting number of milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 = [0,pr,su],[1,bu],[2,bu] h
3 = [0,bu,su],[1,bu],[2,pr] h 4 = [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h
6 = [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 = [0,pr,su],[1,bu],[2,bu] h
9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr - primary, bu - backup, su - supplier node, h -
historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
[1=rebalancing.RebalanceStatisticsTest2] [2=rebalancing.RebalanceStatisticsTest1] 

Interrupted rebalance of group cache.
Rebalance information per cache group (interrupted rebalance): [id=644280849, name=default2,
startTime=2020-04-13 14:55:24,969, finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0] 

Total full and historical rebalance for all cache groups.
Rebalance total information (including successful and not rebalances): [startTime=2020-04-13
10:55:18,726, finishTime=2020-04-13 10:55:18,780, d=54 ms] Supplier statistics: [nodeId=0,
p=60, e=250, b=25000, d=54 ms] [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions,
e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId)
[0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest0] 
Rebalance total information (including successful and not rebalances): [startTime=2020-04-13
15:01:43,822, finishTime=2020-04-13 15:01:44,116, d=294 ms] Supplier statistics: [nodeId=0,
hp=20, he=500, hb=50445, d=294 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration,
h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] 

[1] - https://issues.apache.org/jira/browse/IGNITE-12080
Mime
View raw message