spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-26221) Improve Spark SQL instrumentation and metrics
Date Sat, 01 Dec 2018 04:32:00 GMT


Apache Spark reassigned SPARK-26221:

    Assignee: Reynold Xin  (was: Apache Spark)

> Improve Spark SQL instrumentation and metrics
> ---------------------------------------------
>                 Key: SPARK-26221
>                 URL:
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>            Priority: Major
> This is an umbrella ticket for various small improvements for better metrics and instrumentation.
Some thoughts:
> Differentiate query plan that’s writing data out, vs returning data to the driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the query plan
we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure overhead is
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client fetching result
(e.g. using a result iterator). Record the time blocked on client so we can remove it in measuring
query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a single
query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query execution
id in driver logs.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message