On some RDD actions, I noticed ~500 tasks being executed. In the tasks details, most of the tasks were too small IMO and may be the task startup/shutdown/coordination overhead is coming into picture. The task durations are
Min : 5ms
25th %ile: 9ms
75th %ile: 13 ms
Max: 40 ms
In the RDDs, number of partitions are 428 for Many RDDs built on top of each other. The base RDD could benefit from large number of partitions but RDDs derived from it should have much less # of partitions.
How to control # of partitions @ RDD level ?