spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Du Li <>
Subject Re: Shuffle produces one huge partition and many tiny partitions
Date Thu, 18 Jun 2015 23:55:26 GMT
repartition() means coalesce(shuffle=false) 

     On Thursday, June 18, 2015 4:07 PM, Corey Nolet <> wrote:

 Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, "Du Li" <>

I got the same problem with rdd,repartition() in my streaming app, which generated a few huge
partitions and many tiny partitions. The resulting high data skew makes the processing time
of a batch unpredictable and often exceeding the batch interval. I eventually solved the problem
by using rdd.coalesce() instead, which however is expensive as it yields a lot of shuffle
traffic and also takes a long time.

     On Thursday, June 18, 2015 1:00 AM, Al M <> wrote:

 Thanks for the suggestion.  Repartition didn't help us unfortunately.  It
still puts everything into the same partition.

We did manage to improve the situation by making a new partitioner that
extends HashPartitioner.  It treats certain "exception" keys differently. 
These keys that are known to appear very often are assigned random
partitions instead of using the existing partitioning mechanism.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:


View raw message