spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anish Haldiya <>
Subject Re: Reduce number of partitions before saving to file. coalesce or repartition?
Date Fri, 14 Aug 2015 06:14:00 GMT

If you are decreasing the number of partitions in this RDD, consider
using coalesce, which can avoid performing a shuffle.

However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can pass shuffle = true. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).



On 8/14/15, Alexander Pivovarov <> wrote:
> Hi Everyone
> Which one should work faster (coalesce or repartition) if I need to reduce
> number of partitions from 5000 to 3 before saving RDD asTextFile
> Total data size is about 400MB on disk in text format
> Thank you

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message