spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Imran Rashid (JIRA)" <>
Subject [jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
Date Wed, 09 May 2018 19:15:00 GMT


Imran Rashid commented on SPARK-23206:


I think getting together to discuss the design is still a good idea, but I'd like to suggest
a couple of things that could push this forward a little bit and let things go in parallel
in the meantime.

1. [~elu] is it possible to post the entire change somewhere public?  even if its not in PR
quality, it might let others try the change out on their own workloads to get feedback about
things like how much the log size is changed, utility of metrics, etc.

2. [~cltlfcjin] are you planning on adding the executor memory breakdown you showed in a screenshot
above. eg. RSS, offheap, etc.?  We're very interested in this, but there are a lot of specifics
to work out.  We could also take over on that if you do not have time for that aspect  ...
though by "we" I really mean [~rezasafi] :).  Though I think we'll move discussing those specifics
into SPARK-21157 -- I'd like for this change to be more about the general framework and include
"easy" metrics and we leave SPARK-21157 to build on top of this.

> Additional Memory Tuning Metrics
> --------------------------------
>                 Key: SPARK-23206
>                 URL:
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Spark Core
>    Affects Versions: 2.2.1
>            Reporter: Edwina Lu
>            Priority: Major
>         Attachments: ExecutorsTab.png, ExecutorsTab2.png, MemoryTuningMetricsDesignDoc.pdf,
SPARK-23206 Design Doc.pdf, StageTab.png
> At LinkedIn, we have multiple clusters, running thousands of Spark applications, and
these numbers are growing rapidly. We need to ensure that these Spark applications are well
tuned – cluster resources, including memory, should be used efficiently so that the cluster
can support running more applications concurrently, and applications should run quickly and
> Currently there is limited visibility into how much memory executors are using, and users
are guessing numbers for executor and driver memory sizing. These estimates are often much
larger than needed, leading to memory wastage. Examining the metrics for one cluster for a
month, the average percentage of used executor memory (max JVM used memory across executors
/  spark.executor.memory) is 35%, leading to an average of 591GB unused memory per application
(number of executors * (spark.executor.memory - max JVM used memory)). Spark has multiple
memory regions (user memory, execution memory, storage memory, and overhead memory), and to
understand how memory is being used and fine-tune allocation between regions, it would be
useful to have information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and different memory
regions, the following additional memory metrics can be be tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts and aggregations.
>  * Storage memory: memory used caching and propagating internal data across the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and also per
stage. This information can be shown in the Spark UI and the REST APIs. Information for peak
JVM used memory can help with determining appropriate values for spark.executor.memory and
spark.driver.memory, and information about the unified memory region can help with determining
appropriate values for spark.memory.fraction and spark.memory.storageFraction. Stage memory
information can help identify which stages are most memory intensive, and users can look into
the relevant code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, execution memory
and storage memory to the heartbeat. SparkListeners are modified to collect the new metrics
for the executors, stages and Spark history log. Only interesting values (peak values per
stage per executor) are recorded in the Spark history log, to minimize the amount of additional
> We have attached our design documentation with this ticket and would like to receive
feedback from the community for this proposal.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message