spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Punit Naik <naik.puni...@gmail.com>
Subject Re: repartitionAndSortWithinPartitions HELP
Date Fri, 15 Jul 2016 14:37:52 GMT
Okay that clears my doubt! Thanks a lot.

On 15-Jul-2016 7:43 PM, "Koert Kuipers" <koert@tresata.com> wrote:

spark's shuffle mechanism takes care of this kind of optimization
internally when you use the sort-based shuffle (which is the default).

On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik <naik.punit44@gmail.com> wrote:

> I meant to say that first we can sort the individual partitions and then
> sort them again by merging. Sort of a divide and conquer mechanism.
> Does sortByKey take care of all this internally?
>
>
> On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik <naik.punit44@gmail.com>
> wrote:
>
>> Can we increase the sorting speed of RDD by doing a secondary sort first?
>>
>> On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <naik.punit44@gmail.com>
>> wrote:
>>
>>> Okay. Can't I supply the same partitioner I used for
>>> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>>>
>>> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <koert@tresata.com> wrote:
>>>
>>>> repartitionAndSortWithinPartitions partitions the rdd and sorts within
>>>> each partition. so each partition is fully sorted, but the rdd is not
>>>> sorted.
>>>>
>>>> sortByKey is basically the same as repartitionAndSortWithinPartitions
>>>> except it uses a range partitioner so that the entire rdd is sorted.
>>>> however since sortByKey uses a different partitioner than
>>>> repartitionAndSortWithinPartitions you do not get much benefit from running
>>>> sortByKey after repartitionAndSortWithinPartitions (because all the data
>>>> will get shuffled again)
>>>>
>>>>
>>>> On Thu, Jul 14, 2016 at 1:59 PM, Punit Naik <naik.punit44@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Koert
>>>>>
>>>>> I have already used "repartitionAndSortWithinPartitions" for secondary
>>>>> sorting and it works fine. Just wanted to know whether it will sort the
>>>>> entire RDD or not.
>>>>>
>>>>> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <koert@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> repartitionAndSortWithinPartit sort by keys, not values per key,
so
>>>>>> not really secondary sort by itself.
>>>>>>
>>>>>> for secondary sort also check out:
>>>>>> https://github.com/tresata/spark-sorted
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik <naik.punit44@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys
>>>>>>>
>>>>>>> In my spark/scala code I am implementing secondary sort. I wanted
to
>>>>>>> know, when I call the "repartitionAndSortWithinPartitions" method,
the
>>>>>>> whole (entire) RDD will be sorted or only the individual partitions
will be
>>>>>>> sorted?
>>>>>>> If its the latter case, will applying a "sortByKey" after
>>>>>>> "repartitionAndSortWithinPartitions" be faster now that the individual
>>>>>>> partitions are sorted?
>>>>>>>
>>>>>>> --
>>>>>>> Thank You
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Punit Naik
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thank You
>>>>>
>>>>> Regards
>>>>>
>>>>> Punit Naik
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Thank You
>>
>> Regards
>>
>> Punit Naik
>>
>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

Mime
View raw message