spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anish Haldiya <an...@sigmoidanalytics.com>
Subject Re: Reduce number of partitions before saving to file. coalesce or repartition?
Date Fri, 14 Aug 2015 06:14:00 GMT
Hi,

If you are decreasing the number of partitions in this RDD, consider
using coalesce, which can avoid performing a shuffle.

However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can pass shuffle = true. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).

Regards,

anish

On 8/14/15, Alexander Pivovarov <apivovarov@gmail.com> wrote:
> Hi Everyone
>
> Which one should work faster (coalesce or repartition) if I need to reduce
> number of partitions from 5000 to 3 before saving RDD asTextFile
>
> Total data size is about 400MB on disk in text format
>
> Thank you
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message