spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Punit Naik <naik.puni...@gmail.com>
Subject Re: repartitionAndSortWithinPartitions HELP
Date Thu, 14 Jul 2016 18:38:46 GMT
Can we increase the sorting speed of RDD by doing a secondary sort first?

On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <naik.punit44@gmail.com> wrote:

> Okay. Can't I supply the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>
> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <koert@tresata.com> wrote:
>
>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>> each partition. so each partition is fully sorted, but the rdd is not
>> sorted.
>>
>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>> except it uses a range partitioner so that the entire rdd is sorted.
>> however since sortByKey uses a different partitioner than
>> repartitionAndSortWithinPartitions you do not get much benefit from running
>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>> will get shuffled again)
>>
>>
>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.punit44@gmail.com>
>> wrote:
>>
>>> Hi Koert
>>>
>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>> sorting and it works fine. Just wanted to know whether it will sort the
>>> entire RDD or not.
>>>
>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <koert@tresata.com>
>>> wrote:
>>>
>>>> repartitionAndSortWithinPartit sort by keys, not values per key, so not
>>>> really secondary sort by itself.
>>>>
>>>> for secondary sort also check out:
>>>> https://github.com/tresata/spark-sorted
>>>>
>>>>
>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.punit44@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi guys
>>>>>
>>>>> In my spark/scala code I am implementing secondary sort. I wanted to
>>>>> know, when I call the "repartitionAndSortWithinPartitions" method, the
>>>>> whole (entire) RDD will be sorted or only the individual partitions will
be
>>>>> sorted?
>>>>> If its the latter case, will applying a "sortByKey" after
>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>> partitions are sorted?
>>>>>
>>>>> --
>>>>> Thank You
>>>>>
>>>>> Regards
>>>>>
>>>>> Punit Naik
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>


-- 
Thank You

Regards

Punit Naik

Mime
View raw message