spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <>
Subject [jira] [Resolved] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
Date Tue, 08 Oct 2019 05:42:13 GMT


Hyukjin Kwon resolved SPARK-23607.
    Resolution: Incomplete

> Use HDFS extended attributes to store application summary to improve the Spark History
Server performance
> ---------------------------------------------------------------------------------------------------------
>                 Key: SPARK-23607
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, Web UI
>    Affects Versions: 2.3.0
>            Reporter: Ye Zhou
>            Priority: Minor
>              Labels: bulk-closed
> Currently in Spark History Server, checkForLogs thread will create replaying tasks for
log files which have file size change. The replaying task will filter out most of the log
file content and keep the application summary including applicationId, user, attemptACL, start
time, end time. The application summary data will get updated into listing.ldb and serve the
application list on SHS home page. For a long running application, its log file which name
ends with "inprogress" will get replayed for multiple times to get these application summary.
This is a waste of computing and data reading resource to SHS, which results in the delay
for application to get showing up on home page. Internally we have a patch which utilizes
HDFS extended attributes to improve the performance for getting application summary in SHS.
With this patch, Driver will write the application summary information into extended attributes
as key/value. SHS will try to read from extended attributes. If SHS fails to read from extended
attributes, it will fall back to read from the log file content as usual. This feature can
be enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last updated
timestamp on SHS keeps within 1 minute as we configure the interval to 1 minute. Originally
we had long delay which could be as long as 30 minutes in our scale where we have a large
number of Spark applications running per day.
> We want to see whether this kind of approach is also acceptable to community. Please
comment. If so, I will post a pull request for the changes. Thanks.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message