ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ткаленко кирилл <tkalkir...@yandex.ru>
Subject Re: Extended logging for rebalance performance analysis
Date Mon, 27 Jul 2020 08:51:11 GMT
Discussed in personal correspondence with Stas, decided to improve the message:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
	partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
	topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]

into:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
	partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 KB/sec,
	histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
	fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
	topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]

Where:
partitions=2 - total number of partitions received
entries=60 - total number of entries received
duration=8ms - duration from first demand of message to output of message to log
bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB

avgSpeed= bytesRcvd/duration in KB/sec

histPartitions=1 - total number of partitions received by historical mode
histEntries=30 - total number of entries received by historical mode
histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical mode

fullPartitions=1 - total number of partitions received by full mode
fullEntries=30 - total number of entries received by full mode
fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode

27.07.2020, 11:50, "ткаленко кирилл" <tkalkirill@yandex.ru>:
> Discussed in personal correspondence with Stas, decided to improve the message:
> Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
>
> into:
> Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 KB/sec,
>     histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
>     fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
>
> Where:
> partitions=2 - total number of partitions received
> entries=60 - total number of entries received
> duration=8ms - duration from first demand of message to output of message to log
> bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB
>
> avgSpeed= bytesRcvd/duration in KB/sec
>
> histPartitions=1 - total number of partitions received by historical mode
> histEntries=30 - total number of entries received by historical mode
> histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical mode
>
> fullPartitions=1 - total number of partitions received by full mode
> fullEntries=30 - total number of entries received by full mode
> fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode
>
> 03.07.2020, 17:21, "ткаленко кирилл" <tkalkirill@yandex.ru>:
>> Sorry, forget.
>>
>> [1] - org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
>>
>> 03.07.2020, 17:20, "ткаленко кирилл" <tkalkirill@yandex.ru>:
>>>  Hi, Stan!
>>>
>>>  I don't understand you yet.
>>>
>>>  Now you can use metrics as it was done in the test [1]. Or can you tell me
where to do this, for example when completing rebalancing for all groups?
>>>
>>>  See what is now available and added in the logs:
>>>  1)Which group is rebalanced and which type of rebalance.
>>>  Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4,
minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, fullPartitions=[4, 7], histPartitions=[],
rebalanceId=1]
>>>
>>>  2) Completion of rebalancing from one of the suppliers.
>>>  Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4,
minorTopVer=0], progress=1/3, rebalanceId=1]
>>>
>>>  3) Completion of the entire rebalance.
>>>  Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, duration=41ms,
bytesRcvd=40,4 KB]
>>>
>>>  These messages have a common parameter rebalanceId=1.
>>>
>>>  03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukyanov@gmail.com>:
>>>>>    On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkirill@yandex.ru>
wrote:
>>>>>
>>>>>    To calculate the average value, you can use the existing metrics
"RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft",
"RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>>>
>>>>   You can calculate it, and I believe that this is the first thing anyone
would do when reading these logs and metrics.
>>>>   If that's an essential thing then maybe it should be available out of
the box?
>>>>
>>>>>    This also works correctly with the historical rebalance.
>>>>>    Now we can see rebalance type for each group and for each supplier
in logs. I don't think we should duplicate this information.
>>>>>
>>>>>    [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root]
Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0],
supplier=a8be67b8-8ec7-4175-aa04-a59577100000, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[],
rebalanceId=1]
>>>>
>>>>   I'm talking about adding info on how much data has been transferred during
rebalance.
>>>>   When rebalance completes I'd like to know how much data has been transferred,
was it historical or full, what was the average rebalance speed.
>>>>
>>>>   There are two reasons for having all that.
>>>>
>>>>   First, it helps to analyze the issues by searching the logs and looking
for anomalies.
>>>>
>>>>   Second, this allows to automate alerts: e.g. if you know your typical
historical rebalance speed, you can trigger an alert if it drops below that.
>>>>
>>>>>    03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukyanov@gmail.com>:
>>>>>>    Kirill,
>>>>>>
>>>>>>    I've looked through the patch.
>>>>>>    Looks good, but it feels like the first thing someone will
try to do given bytesRcvd and duration is to divide one by another to get an average speed.
>>>>>>    Do you think it's reasonable to also add it to the logs? Maybe
even to the metrics?
>>>>>>
>>>>>>    Also, this works with historical rebalance, right? Can we specify
how much data was transferred via historical or full rebalance from each supplier?
>>>>>>    Maybe even estimate transfer speed in entries and bytes for
each rebalance type?
>>>>>>
>>>>>>    Thanks,
>>>>>>    Stan
>>>>>>
>>>>>>>     On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glukos@gmail.com>
wrote:
>>>>>>>
>>>>>>>     +1 to Alex G.
>>>>>>>
>>>>>>>     From my experience, the most interesting cases with Ignite
rebalancing
>>>>>>>     happen exactly in production. According to the fact that
we already have
>>>>>>>     detailed rebalancing logging, adding info about rebalance
performance looks
>>>>>>>     like a reasonable improvement. With new logs we'll be
able to detect and
>>>>>>>     investigate situations when rebalance is slow due to
uneven suppliers
>>>>>>>     distribution or network issues.
>>>>>>>     Option to disable the feature in runtime shouldn't be
used often, but it
>>>>>>>     will keep us on the safe side in case something goes
wrong.
>>>>>>>     The format described in
>>>>>>>     https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>>>     good to me.
>>>>>>>
>>>>>>>     On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл
<tkalkirill@yandex.ru>
>>>>>>>     wrote:
>>>>>>>
>>>>>>>>     Hello, Alexey!
>>>>>>>>
>>>>>>>>     Currently there is no way to disable / enable it,
but it seems that the
>>>>>>>>     logs will not be overloaded, since Alexei Scherbakov
offer seems reasonable
>>>>>>>>     and compact. Of course, you can add disabling / enabling
statistics
>>>>>>>>     collection via jmx for example.
>>>>>>>>
>>>>>>>>     23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncharuk@gmail.com>:
>>>>>>>>>     Hello Maxim, folks,
>>>>>>>>>
>>>>>>>>>     ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov
<mmuzaf@apache.org>:
>>>>>>>>>
>>>>>>>>>>     We won't do performance analysis on the production
environment. Each
>>>>>>>>>>     time we need performance analysis it will
be done on a test
>>>>>>>>>>     environment with verbose logging enabled.
Thus I suggest moving these
>>>>>>>>>>     changes to a separate `profiling` module
and extend the logging much
>>>>>>>>>>     more without any ышяу limitations. The
same as these [2] [3]
>>>>>>>>>>     activities do.
>>>>>>>>>
>>>>>>>>>     I strongly disagree with this statement. I am
not sure who is meant here
>>>>>>>>>     by 'we', but I see a strong momentum in increasing
observability tooling
>>>>>>>>>     that helps people to understand what exactly
happens in the production
>>>>>>>>>     environment [1]. Not everybody can afford two
identical environments for
>>>>>>>>>     testing. We should make sure users have enough
information to understand
>>>>>>>>>     the root cause after the incident happened, and
not force them to
>>>>>>>>     reproduce
>>>>>>>>>     it, let alone make them add another module to
the classpath and restart
>>>>>>>>     the
>>>>>>>>>     nodes.
>>>>>>>>>     I think having this functionality in the core
module with the ability to
>>>>>>>>>     disable/enable it is the right approach. Having
the information printed
>>>>>>>>     to
>>>>>>>>>     log is ok, having it in an event that can be
sent to a monitoring/tracing
>>>>>>>>>     subsystem is even better.
>>>>>>>>>
>>>>>>>>>     Kirill, can we enable and disable this feature
in runtime to avoid the
>>>>>>>>     very
>>>>>>>>>     same nodes restart?
>>>>>>>>>
>>>>>>>>>     [1]
>>>>>>>>     https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Mime
View raw message