spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <and...@databricks.com>
Subject Re: Difference between Sort based and Hash based shuffle
Date Wed, 19 Aug 2015 20:20:49 GMT
Yes, in other words, a "bucket" is a single file in hash-based shuffle (no
consolidation), but a segment of partitioned file in sort-based shuffle.

2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed <11besemjaved@seecs.edu.pk>:

> Thanks Andrew for a detailed response,
>
> So the reason why key value pairs with same keys are always found in a
> single buckets in Hash based shuffle but not in Sort is because in
> sort-shuffle each mapper writes a single partitioned file, and it is up to
> the reducer to fetch correct partitions from the the files ?
>
> On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or <andrew@databricks.com> wrote:
>
>> Hi Muhammad,
>>
>> On a high level, in hash-based shuffle each mapper M writes R shuffle
>> files, one for each reducer where R is the number of reduce partitions.
>> This results in M * R shuffle files. Since it is not uncommon for M and R
>> to be O(1000), this quickly becomes expensive. An optimization with
>> hash-based shuffle is consolidation, where all mappers run in the same core
>> C write one file per reducer, resulting in C * R files. This is a strict
>> improvement, but it is still relatively expensive.
>>
>> Instead, in sort-based shuffle each mapper writes a single partitioned
>> file. This allows a particular reducer to request a specific portion of
>> each mapper's single output file. In more detail, the mapper first fills up
>> an internal buffer in memory and continually spills the contents of the
>> buffer to disk, then finally merges all the spilled files together to form
>> one final output file. This places much less stress on the file system and
>> requires much fewer I/O operations especially on the read side.
>>
>> -Andrew
>>
>>
>>
>> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
>> 11besemjaved@seecs.edu.pk>:
>>
>>> I did check it out and although I did get a general understanding of the
>>> various classes used to implement Sort and Hash shuffles, however these
>>> slides lack details as to how they are implemented and why sort generally
>>> has better performance than hash
>>>
>>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ravikiranmagham@gmail.com>
>>> wrote:
>>>
>>>> Have a look at this presentation.
>>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>>>> of help to you.
>>>>
>>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>>> 11besemjaved@seecs.edu.pk> wrote:
>>>>
>>>>> What are the major differences between how Sort based and Hash based
>>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>>> than Hash?
>>>>> Any talks that discuss both shuffles in detail, how they are
>>>>> implemented and the performance gains ?
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message