spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carson Wang (JIRA)" <>
Subject [jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)
Date Fri, 08 Jul 2016 06:15:11 GMT


Carson Wang commented on SPARK-16333:

This doesn't look like related to SPARK-11206 as this is related to the existing event SparkListenerTaskEnd.
The huge logs come from the TaskInfo.accumulables. Is this related to the accumulator/metrics
changes in 2.0? From the logs, this metric "internal.metrics.updatedBlockStatuses" is very
verbose as the RDD may have thousands of blocks.

> Excessive Spark history event/json data size (5GB each)
> -------------------------------------------------------
>                 Key: SPARK-16333
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>         Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) and ppc platform
(Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server release 7.2 (Maipo)., Spark2.0.0-preview
(May-24, 2016 build)
>            Reporter: Peter Liu
>              Labels: performance, spark2.0.0
> With Spark2.0.0-preview (May-24 build), the history event data (the json file), that
is generated for each Spark application run (see below), can be as big as 5GB (instead of
14 MB for exactly the same application run and the same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-0000
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-0000
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-0000
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-0000
> The test is done with Sparkbench V2, SQL RDD (see github:

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message