spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Ryza <sandy.r...@cloudera.com>
Subject Re: Increase partition count (repartition) without shuffle
Date Thu, 18 Jun 2015 21:36:10 GMT
Hi Alexander,

There is currently no way to create an RDD with more partitions than its
parent RDD without causing a shuffle.

However, if the files are splittable, you can set the Hadoop configurations
that control split size to something smaller so that the HadoopRDD ends up
with more partitions.

-Sandy

On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander <alexander.ulanov@hp.com>
wrote:

>  Hi,
>
>
>
> Is there a way to increase the amount of partition of RDD without causing
> shuffle? I’ve found JIRA issue
> https://issues.apache.org/jira/browse/SPARK-5997 however there is no
> implementation yet.
>
>
>
> Just in case, I am reading data from ~300 big binary files, which results
> in 300 partitions, then I need to sort my RDD, but it crashes with
> outofmemory exception. If I change the number of partitions to 2000, sort
> works OK, but repartition itself takes a lot of time due to shuffle.
>
>
>
> Best regards, Alexander
>

Mime
View raw message