spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Punit Naik <naik.puni...@gmail.com>
Subject Re: repartitionAndSortWithinPartitions HELP
Date Thu, 14 Jul 2016 18:57:08 GMT
I meant to say that first we can sort the individual partitions and then
sort them again by merging. Sort of a divide and conquer mechanism.
Does sortByKey take care of all this internally?

On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik <naik.punit44@gmail.com> wrote:

> Can we increase the sorting speed of RDD by doing a secondary sort first?
>
> On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <naik.punit44@gmail.com>
> wrote:
>
>> Okay. Can't I supply the same partitioner I used for
>> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>>
>> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <koert@tresata.com> wrote:
>>
>>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>>> each partition. so each partition is fully sorted, but the rdd is not
>>> sorted.
>>>
>>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>>> except it uses a range partitioner so that the entire rdd is sorted.
>>> however since sortByKey uses a different partitioner than
>>> repartitionAndSortWithinPartitions you do not get much benefit from running
>>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>>> will get shuffled again)
>>>
>>>
>>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.punit44@gmail.com>
>>> wrote:
>>>
>>>> Hi Koert
>>>>
>>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>>> sorting and it works fine. Just wanted to know whether it will sort the
>>>> entire RDD or not.
>>>>
>>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <koert@tresata.com>
>>>> wrote:
>>>>
>>>>> repartitionAndSortWithinPartit sort by keys, not values per key, so
>>>>> not really secondary sort by itself.
>>>>>
>>>>> for secondary sort also check out:
>>>>> https://github.com/tresata/spark-sorted
>>>>>
>>>>>
>>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.punit44@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi guys
>>>>>>
>>>>>> In my spark/scala code I am implementing secondary sort. I wanted
to
>>>>>> know, when I call the "repartitionAndSortWithinPartitions" method,
the
>>>>>> whole (entire) RDD will be sorted or only the individual partitions
will be
>>>>>> sorted?
>>>>>> If its the latter case, will applying a "sortByKey" after
>>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>>> partitions are sorted?
>>>>>>
>>>>>> --
>>>>>> Thank You
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Punit Naik
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thank You
>>>>
>>>> Regards
>>>>
>>>> Punit Naik
>>>>
>>>
>>>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>



-- 
Thank You

Regards

Punit Naik

Mime
View raw message