spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jey Kottalam <...@cs.berkeley.edu>
Subject Re: Too many RDD partititons ???
Date Thu, 23 Jan 2014 21:55:47 GMT
You can use the RDD.coalese method to adjust the partitioning of a
single RDD. Note that you'll need to set "shuffle=true" when
increasing the number of partitions.

See: http://spark.incubator.apache.org/docs/latest/api/core/org/apache/spark/rdd/RDD.html#coalesce(Int,Boolean):RDD[T]

On Thu, Jan 23, 2014 at 1:18 PM, Manoj Samel <manojsameltech@gmail.com> wrote:
> Hi,
>
> On some RDD actions, I noticed ~500 tasks being executed. In the tasks
> details, most of the tasks were too small IMO and may be the task
> startup/shutdown/coordination overhead is coming into picture. The task
> durations are
>
> Min : 5ms
> 25th %ile: 9ms
> Median: 10ms
> 75th %ile: 13 ms
> Max: 40 ms
>
> In the RDDs, number of partitions are 428 for Many RDDs built on top of each
> other. The base RDD could benefit from large number of partitions but RDDs
> derived from it should have much less # of partitions.
>
> How to control # of partitions @ RDD level ?
>
>

Mime
View raw message