spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Re: Shuffle produces one huge partition and many tiny partitions
Date Fri, 19 Jun 2015 00:43:20 GMT
Sorry Du,

Repartition means coalesce(shuffle = true) as per [1]. They are the same
operation. Coalescing with shuffle = false means you are specifying the max
amount of partitions after the coalesce (if there are less partitions you
will end up with the lesser amount.


[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341


On Thu, Jun 18, 2015 at 7:55 PM, Du Li <lidu@yahoo-inc.com.invalid> wrote:

> repartition() means coalesce(shuffle=false)
>
>
>
>   On Thursday, June 18, 2015 4:07 PM, Corey Nolet <cjnolet@gmail.com>
> wrote:
>
>
> Doesn't repartition call coalesce(shuffle=true)?
> On Jun 18, 2015 6:53 PM, "Du Li" <lidu@yahoo-inc.com.invalid> wrote:
>
> I got the same problem with rdd,repartition() in my streaming app, which
> generated a few huge partitions and many tiny partitions. The resulting
> high data skew makes the processing time of a batch unpredictable and often
> exceeding the batch interval. I eventually solved the problem by using
> rdd.coalesce() instead, which however is expensive as it yields a lot of
> shuffle traffic and also takes a long time.
>
> Du
>
>
>
>   On Thursday, June 18, 2015 1:00 AM, Al M <alasdair.mcbride@gmail.com>
> wrote:
>
>
> Thanks for the suggestion.  Repartition didn't help us unfortunately.  It
> still puts everything into the same partition.
>
> We did manage to improve the situation by making a new partitioner that
> extends HashPartitioner.  It treats certain "exception" keys differently.
> These keys that are known to appear very often are assigned random
> partitions instead of using the existing partitioning mechanism.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-produces-one-huge-partition-and-many-tiny-partitions-tp23358p23387.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>
>

Mime
View raw message