ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Muzafarov <mmu...@apache.org>
Subject Re: Extended logging for rebalance performance analysis
Date Wed, 06 May 2020 18:01:30 GMT
Kirill,


Thank you for raising this topic. It's true that the rebalance process
still requires additional information for analyzing issues. Please,
don't think that I'm against your changes :-)

* My short answer. *

We won't do performance analysis on the production environment. Each
time we need performance analysis it will be done on a test
environment with verbose logging enabled. Thus I suggest moving these
changes to a separate `profiling` module and extend the logging much
more without any ышяу limitations. The same as these [2] [3]
activities do.

Let's keep the `core` module as simple as possible.
Let's design the right API for accessing rebalance internals for
profiling tools.
Can you, please, remove all changes from your PR [6] which are not
related to the proposed topic? (e.g. variable renamings).


* The long answer. *

Here are my thoughts on this. There are two different types of issues
in the rebalance process. The first case must be covered by daily
monitoring subsystems, the second case must be covered by additional
profiling tools.
1. errors during the rebalancing (e.g. rebalance not happens when required)
2. performance rebalancing issues (e.g. rebalance is slow)

Daily monitoring tools (JMX, Logging) are always turned on and
shouldn't require additional systems resources themselves. Since these
metrics must be lightweight any internal aggregation machinery is not
used on them. All these metrics are collected from each node
independently. Please, take a look at this issue [1] which covers most
of your questions mentioned above.

For all available metrics, we can configure LogExporterSpi, so they
will be available in logs.

> 1)How long was the balance(divided by supplier)?
rebalancingStartTime, rebalancingEndTime already exists for cache groups [4].
We can add the same for suppliers.

> 2)How many records and bytes per supplier were rebalanced?
We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
We will have rebalancingExpectedKeys [5].
We can add a new metric per cache keys, per supplier.

> 3)How many times did the rebalance restart?
rebalancingLastCancelledTime [4] metric exists.
Do we need to keep historical data on it?

> 4)Which partitions were rebalanced and from which nodes did they receive them?
Let's print this information to log prior to the rebalance process
starts. This will be helpful information and do not require a lot of
changes.

> 5)When did rebalance for all cache groups end?
This metric may be aggregated from rebalancingEndTime [4] by pulling
it from all nodes for all caches.


Performance rebalancing issues are related to profiling tools. They
may require additional system resources and definitely require a
dedicated environment for tests. We can't do performance analysis on
production environments due to performance impact.
I see some disadvantages of adding such tools to production code:
- verbose logging may affect performance.
- the problematic process may become even worse if an automatic
threshold suddenly turns on.
- new code changes will require additional efforts to keep logging up-to-date.


[1] https://issues.apache.org/jira/browse/IGNITE-12183
[2] https://issues.apache.org/jira/browse/IGNITE-12666
[3] https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
[4] https://issues.apache.org/jira/browse/IGNITE-12193
[5] https://issues.apache.org/jira/browse/IGNITE-12194
[6] https://github.com/apache/ignite/pull/7705/files

On Wed, 6 May 2020 at 19:50, Ivan Rakov <ivan.glukos@gmail.com> wrote:
>
> Hi,
>
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > duration rebalance of cache group after which partitions distribution is
> > output, set in milliseconds, default value is 10 minutes.
>
>  Does it mean that if the rebalancing process took less than 10 minutes,
> only a short version of the message (with supplier statistics) will show up?
>
> In general, I have no objections.
>
>
> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tkalkirill@yandex.ru>
> wrote:
>
> > Hi, Igniters!
> >
> > I'd like to share a new small feature in AI [1].
> >
> > Current rebalance logging does not allow you to quickly answer following
> > questions:
> > 1)How long was the balance(divided by supplier)?
> > 2)How many records and bytes per supplier were rebalanced?
> > 3)How many times did rebalance restart?
> > 4)Which partitions were rebalanced and from which nodes did they receive
> > them?
> > 5)When did rebalance for all cache groups end?
> >
> > What you can see in logs now:
> >
> > 1)Starting rebalance with order of cache groups.
> > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
> >
> > 2)Start rebalance of cache group from a specific supplier, specifying
> > partition ids and mode - historical or full.
> > Starting rebalance routine [ignite-sys-cache,
> > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> > supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
> > histPartitions=[]]
> >
> > 3)Getting partial or complete partitions of cache group.
> > Completed rebalancing [grp=ignite-sys-cache,
> > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
> > Completed (final) rebalancing [grp=ignite-sys-cache,
> > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
> >
> > 4)End rebalance of cache group.
> > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
> > lastCancelledTime=-1]
> >
> > Rebalance statistics:
> >
> > To speed up rebalance analysis, statistics will be output for each cache
> > group and total for all cache groups.
> > If duration rebalance for cache group is greater than threshold value,
> > partition distribution is output.
> > Statistics will you to analyze duration of the balance for each supplier
> > to understand which of them has been transmitting data for longest time.
> >
> > System properties are used to output statistics:
> >
> > IGNITE_QUIET - to output statistics, value must be false;
> > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > duration rebalance of cache group after which partitions distribution is
> > output, set in milliseconds, default value is 10 minutes.
> >
> > Statistics examples:
> >
> > Successful full and historical rebalance of group cache, without
> > partitions distribution.
> > Rebalance information per cache group (successful rebalance): [id=3181548,
> > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest0]
> > Rebalance information per cache group (successful rebalance): [id=3181547,
> > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >
> > Successful full and historical rebalance of group cache, with partitions
> > distribution.
> > Rebalance information per cache group (successful rebalance): [id=3181548,
> > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
> > than 5 ms, printing detailed information about partitions distribution
> > (threshold can be changed by setting number of milliseconds into
> > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
> > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest2]
> > [2=rebalancing.RebalanceStatisticsTest0]
> > Rebalance information per cache group (successful rebalance): [id=3181547,
> > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Rebalance
> > duration was greater than 100 ms, printing detailed information about
> > partitions distribution (threshold can be changed by setting number of
> > milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
> > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> > [1=rebalancing.RebalanceStatisticsTest2]
> > [2=rebalancing.RebalanceStatisticsTest1]
> >
> > Interrupted rebalance of group cache.
> > Rebalance information per cache group (interrupted rebalance):
> > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
> >
> > Total full and historical rebalance for all cache groups.
> > Rebalance total information (including successful and not rebalances):
> > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
> > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
> > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
> > entries, b - bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest0]
> > Rebalance total information (including successful and not rebalances):
> > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
> > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294
> > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
> > historical, nodeId mapping (nodeId=id,consistentId)
> > [0=rebalancing.RebalanceStatisticsTest0]
> >
> > [1] - https://issues.apache.org/jira/browse/IGNITE-12080

Mime
View raw message