flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mnxfst <mnx...@gmail.com>
Subject Re: Accumulators/Metrics
Date Thu, 12 Nov 2015 21:01:31 GMT
Hi Nick,

as Max mentioned in an earlier post on this topic, I started to work on a
service to collect metrics from running stream processing jobs. We want to
have all our metrics in one place whatever application (type) they come
from.

To integrate that behavior, I started to look at the accumulator API and
learned from Max that all these information are collected for each task and
get forwarded to the job manager. The job manager in turn provides a network
exposed interface to interact with it (see
org.apache.flink.runtime.messages.JobManagerMessages for more) using akka.

What I did was to request for all running jobs, fetching more detailed
information for each of them. You receive the accumulator values previously
set.

As the API currently provides only simple value counters, a basic average
accumulator and a histogram (I have not worked with yet), I started to
extend this to allow the use of metrics somehow similar to gauges, meters,
timers, histograms and counters as defined by the dropwizard metrics
framework.

Unfortunately, an integration with the framework seems to be a more
wild-hack oriented task. Therefore I decided to try out a smarter approach
which even makes it simple on the flink framework side.

If you know the graphite application, you will know that it receives a
metric identifier, the current value and a timestamp as input. Everything
else is handled either by graphite or a switched in statsd. 

To reduce any dependency from such external tools, I am actually working on
a basic metric implementation which provides those metrics mentioned above.
These are aggregated by the collector and may be forwarded towards any
metrics system, eg. graphite. 

The overall idea is to keep things very simple as it may lead to heavy
network traffic if too complex metrics types are provided on job side and
must be transferred over the network. Keep it simple and do the aggregation
on collector side.

Your objection regarding the network traffic towards the job manager is
valid and important. I haven't really thought about that so far, but maybe a
more distributed approach must be found to avoid a bottleneck situation
here.

If you are interested in the solution that will be used throughout the jobs
running in our environment, I hope this will be released as open source
anytime soon since the Otto Group believes in open source ;-) If you would
like to know more about it, feel free to ask ;-) 

Best 
  Christian (Kreutzfeldt)


Nick Dimiduk wrote
> I'm much more interested in as-they-happening metrics than job completion
> summaries as these are stream processing jobs that should "never end".
> Ufuk's suggestion of a subtask-unique counter, combined with
> rate-of-change
> functions in a tool like InfluxDB will probably work for my needs. So too
> does managing my own dropwizard MetricRegistry.
> 
> An observation: routing all online metrics through the heartbeat mechanism
> to a single host for display sounds like a scalability bottleneck. Doesn't
> this design limit the practical volume of metrics that can be exposed by
> the runtime and user applications?





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Accumulators-Metrics-tp3447p3459.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Mime
View raw message