spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Re: Shuffle produces one huge partition and many tiny partitions
Date Thu, 18 Jun 2015 23:06:46 GMT
Doesn't repartition call coalesce(shuffle=true)?
On Jun 18, 2015 6:53 PM, "Du Li" <lidu@yahoo-inc.com.invalid> wrote:

> I got the same problem with rdd,repartition() in my streaming app, which
> generated a few huge partitions and many tiny partitions. The resulting
> high data skew makes the processing time of a batch unpredictable and often
> exceeding the batch interval. I eventually solved the problem by using
> rdd.coalesce() instead, which however is expensive as it yields a lot of
> shuffle traffic and also takes a long time.
>
> Du
>
>
>
>   On Thursday, June 18, 2015 1:00 AM, Al M <alasdair.mcbride@gmail.com>
> wrote:
>
>
> Thanks for the suggestion.  Repartition didn't help us unfortunately.  It
> still puts everything into the same partition.
>
> We did manage to improve the situation by making a new partitioner that
> extends HashPartitioner.  It treats certain "exception" keys differently.
> These keys that are known to appear very often are assigned random
> partitions instead of using the existing partitioning mechanism.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-produces-one-huge-partition-and-many-tiny-partitions-tp23358p23387.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>

Mime
View raw message