spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Canali <>
Subject RE: [DISCUSS][CORE] Exposing application status metrics via a source
Date Fri, 14 Sep 2018 15:47:50 GMT
Hi Stavros, All,

Interesting topic, I add here some thoughts and personal opinions on it: I find too the metrics
system quite useful for the use case of building Grafana dashboards as opposed to scraping
logs and/or using the Event Listener infrastructure, as you mentioned in your mail.
A few additional points in favour of Dropwizard metrics for me are:

-          Regarding the metrics defined on the ExecutorSource, I believe they have better
scalability compare to standard Task Metrics, as the Dropwizard metrics go directly from executors
to sink(s) rather than passing via the driver through the ListenerBus.

-          Another advantage that I see is that Dropwizard metrics make it easy to expose
information not available otherwise from the EveloLog/Listener events, such as executor.jvmCpuTime

I ’d like to add some feedback and random thoughts based on recent work on SPARK-25228 and
SPARK-22190, SPARK-25277, SPARK-25285:

-          the “Dropwizard metrics” space currently appears a bit “crowded”,  we could
probably profit from adding a few configuration parameters to turn some of the metrics on/off
as needed (I see that this point is also raised in the discussion in your PR 22381).

-          Another point is that the metrics instrumentation is a bit scattered around the
code, it would be nice to have a central point where the available metrics are exposed (maybe
just in the documentation).

-          Testing of new metrics seems to be a bit of a manual process at the moment (at
least it was for me) which could be improved. Related to that I believe that some recent work
on adding new metrics has ended up with a minor side effect/issue, details in SPARK-25277.

Best regards,

From: Stavros Kontopoulos <>
Sent: Wednesday, September 12, 2018 22:35
To: Dev <>
Subject: [DISCUSS][CORE] Exposing application status metrics via a source

Hi all,

I have a PR that exposes application status
metrics (related jira: SPARK-25394).

So far metrics tooling needs to scrape the metrics rest api to get metrics like job delay,
stages failed, stages completed etc.
From devops perspective it is good to standardize on a unified way of gathering metrics.
The need came up on the K8s side where jmx prometheus exporter is commonly used to scrape
metrics for several components such as kafka, cassandra, but the need is not limited there.

Check comment here<>:
"The rest api is great for UI and consolidated analytics, but monitoring through it is not
as straightforward as when the data emits directly from the source like this. There is all
kinds of nice context that we get when the data from this spark node is collected directly
from the node itself, and not proxied through another collector / reporter. It is easier to
build a monitoring data model across the cluster when node, jmx, pod, resource manifests,
and spark data all align by virtue of coming from the same collector. Building a similar view
of the cluster just from the rest api, as a comparison, is simply harder and quite challenging
to do in general purpose terms."

The PR is ok to be merged but the major concern here is the mirroring of the metrics. I think
that mirroring is ok since people may dont want to check the ui and they just want to integrate
with jmx only (my use case) and gather metrics in grafana (common case out there).

Does any of the committers or the community have an opinion on this?
Is there an agreement about moving on with this? Note that the addition does not change much
and can always be refactored if we come up with a new plan for the metrics story in the future.

View raw message