spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (SPARK-4630) Dynamically determine optimal number of partitions
Date Tue, 11 Oct 2016 03:52:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin closed SPARK-4630.
------------------------------
    Resolution: Duplicate
      Assignee:     (was: Kostas Sakellis)

> Dynamically determine optimal number of partitions
> --------------------------------------------------
>
>                 Key: SPARK-4630
>                 URL: https://issues.apache.org/jira/browse/SPARK-4630
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark job. There
is a direct relationship between the size of partitions to the number of tasks - larger partitions,
fewer tasks. For better performance, Spark has a sweet spot for how large partitions should
be that get executed by a task. If partitions are too small, then the user pays a disproportionate
cost in scheduling overhead. If the partitions are too large, then task execution slows down
due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) of partitions
that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the partition
sizing for the user. This feature can be turned off/on with a configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into account partition
sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the
sizes of the partitions and determine the appropriate number tasks to create. Since this change
requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached
before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message