spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mri...@gmail.com>
Subject Re: Increase partition count (repartition) without shuffle
Date Thu, 18 Jun 2015 22:02:13 GMT
If you can scan input twice, you can of course do per partition count and
build custom RDD which can reparation without shuffle.
But nothing off the shelf as Sandy mentioned.

Regards
Mridul

On Thursday, June 18, 2015, Sandy Ryza <sandy.ryza@cloudera.com> wrote:

> Hi Alexander,
>
> There is currently no way to create an RDD with more partitions than its
> parent RDD without causing a shuffle.
>
> However, if the files are splittable, you can set the Hadoop
> configurations that control split size to something smaller so that the
> HadoopRDD ends up with more partitions.
>
> -Sandy
>
> On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander <
> alexander.ulanov@hp.com
> <javascript:_e(%7B%7D,'cvml','alexander.ulanov@hp.com');>> wrote:
>
>>  Hi,
>>
>>
>>
>> Is there a way to increase the amount of partition of RDD without causing
>> shuffle? I’ve found JIRA issue
>> https://issues.apache.org/jira/browse/SPARK-5997 however there is no
>> implementation yet.
>>
>>
>>
>> Just in case, I am reading data from ~300 big binary files, which results
>> in 300 partitions, then I need to sort my RDD, but it crashes with
>> outofmemory exception. If I change the number of partitions to 2000, sort
>> works OK, but repartition itself takes a lot of time due to shuffle.
>>
>>
>>
>> Best regards, Alexander
>>
>
>

Mime
View raw message