spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manoj Samel <>
Subject Too many RDD partititons ???
Date Thu, 23 Jan 2014 21:18:14 GMT

On some RDD actions, I noticed ~500 tasks being executed. In the tasks
details, most of the tasks were too small IMO and may be the task
startup/shutdown/coordination overhead is coming into picture. The task
durations are

Min : 5ms
25th %ile: 9ms
Median: 10ms
75th %ile: 13 ms
Max: 40 ms

In the RDDs, number of partitions are 428 for Many RDDs built on top of
each other. The base RDD could benefit from large number of partitions but
RDDs derived from it should have much less # of partitions.

How to control # of partitions @ RDD level ?

View raw message