spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <>
Subject Re: Difference between Sort based and Hash based shuffle
Date Wed, 19 Aug 2015 20:20:49 GMT
Yes, in other words, a "bucket" is a single file in hash-based shuffle (no
consolidation), but a segment of partitioned file in sort-based shuffle.

2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed <>:

> Thanks Andrew for a detailed response,
> So the reason why key value pairs with same keys are always found in a
> single buckets in Hash based shuffle but not in Sort is because in
> sort-shuffle each mapper writes a single partitioned file, and it is up to
> the reducer to fetch correct partitions from the the files ?
> On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or <> wrote:
>> Hi Muhammad,
>> On a high level, in hash-based shuffle each mapper M writes R shuffle
>> files, one for each reducer where R is the number of reduce partitions.
>> This results in M * R shuffle files. Since it is not uncommon for M and R
>> to be O(1000), this quickly becomes expensive. An optimization with
>> hash-based shuffle is consolidation, where all mappers run in the same core
>> C write one file per reducer, resulting in C * R files. This is a strict
>> improvement, but it is still relatively expensive.
>> Instead, in sort-based shuffle each mapper writes a single partitioned
>> file. This allows a particular reducer to request a specific portion of
>> each mapper's single output file. In more detail, the mapper first fills up
>> an internal buffer in memory and continually spills the contents of the
>> buffer to disk, then finally merges all the spilled files together to form
>> one final output file. This places much less stress on the file system and
>> requires much fewer I/O operations especially on the read side.
>> -Andrew
>> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
>>> I did check it out and although I did get a general understanding of the
>>> various classes used to implement Sort and Hash shuffles, however these
>>> slides lack details as to how they are implemented and why sort generally
>>> has better performance than hash
>>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <>
>>> wrote:
>>>> Have a look at this presentation.
>>>> . Can be
>>>> of help to you.
>>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>>>> wrote:
>>>>> What are the major differences between how Sort based and Hash based
>>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>>> than Hash?
>>>>> Any talks that discuss both shuffles in detail, how they are
>>>>> implemented and the performance gains ?

View raw message