hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Krogen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values
Date Wed, 01 Nov 2017 22:45:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234894#comment-16234894
] 

Erik Krogen commented on HADOOP-14989:
--------------------------------------

Thank you for the comments [~eyang]! You actually made me realize I had a bit of a misunderstanding
after digging into the code further. Let me try again:
* The problem I described is definitely an issue if you specify multiple refresh rates. I
agree there's not a great way around this issue but I think we should, at minimum, put something
in the documentation indicating that it is not a good idea. Right now the behavior I describe
when dealing with MutableRate values is not documented and would come as a surprise to an
operator.
* Specifying only a single refresh rate does not solve the JMX issue. The single-point collection
of metrics for all sinks occurs in {{MetricsSystemImpl}}, specifically {{sampleMetrics()}},
which then passes off the single {{MetricsBuffer}} to all sinks. This is great. However, JMX
avoids the {{MetricsSystemImpl}} code altogether, instead directly calling {{getMetrics()}}
on each {{MetricsSourceAdapter}}. Thus JMX cache refills can destroy metrics values even if
you correctly configure only one period. I have attached a patch, [^HADOOP-14989.test.patch],
which demonstrates this issue - it's hacky but it should get the point across.

It seems to me the best way to fix this is to save the output values each time {{getMetrics()}}
is called and use those for the cache. We can either
* Call {{updateJmxCache()}} at the end of {{getMetrics()}} with the computed values
* Store the return value of {{getMetrics()}} and use it as the input for {{updateJmxCache()}}
next it is called, assuming that value is fresh enough.

The second is considerably more complex. It avoids some potential performance penalty of the
{{updateAttrCache()}} and {{updateInfoCache()}} calls, which do create a bunch of objects.
Not sure if it would be enough to be worth the extra complexity.

While digging / testing I also noticed another bug which occurs if you have multiple sink
periods set; see HADOOP-15008

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values
> -----------------------------------------------------------------------------------
>
>                 Key: HADOOP-14989
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14989
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: metrics
>    Affects Versions: 2.6.5
>            Reporter: Erik Krogen
>            Priority: Critical
>         Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the way {{MutableStat}}
values are collected (and thus {{MutableRate}}, since it is based off of {{MutableStat}})
mean that each sink configured (including JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since last snapshot,
as well as operation count since last snapshot. Upon snapshotting, the average is calculated
as (total / opCount) and placed into a gauge metric, and total / operation count are cleared.
So the average value represents the average since the last snapshot. If only a single sink
ever snapshots, this would result in the expected behavior that the value is the average over
the reporting period. However, if multiple sinks are configured, or if the JMX cache is refreshed,
this is another snapshot operation. So, for example, if you have a FileSink configured at
a 60 second interval and your JMX cache refreshes itself 1 second before the FileSink period
fires, the values emitted to your FileSink only represent averages _over the last one second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is not realistic
given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to {{MutableQuantiles}},
which has a refresh thread and saves a snapshot of the last quantile values that it will serve
up until the next refresh. Given how many {{MutableRate}} metrics there are, a thread per
metric is not really feasible, but could be done on e.g. a per-source basis. This has some
downsides: if multiple sinks are configured with different periods, what is the right refresh
period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation count,
rather than an average gauge and an operation count counter. The average could then be calculated
downstream from this information. This is cumbersome for operators and not backwards compatible.
To improve on both of those downsides, we could have it keep the current behavior but _additionally_
emit the total as a counter. The snapshotted average is probably sufficient in the common
case (we've been using it for years), and when more guaranteed accuracy is required, the average
could be derived from the total and operation count.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message