spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <>
Subject Re: Difference between Sort based and Hash based shuffle
Date Tue, 18 Aug 2015 21:13:52 GMT
Hi Muhammad,

On a high level, in hash-based shuffle each mapper M writes R shuffle
files, one for each reducer where R is the number of reduce partitions.
This results in M * R shuffle files. Since it is not uncommon for M and R
to be O(1000), this quickly becomes expensive. An optimization with
hash-based shuffle is consolidation, where all mappers run in the same core
C write one file per reducer, resulting in C * R files. This is a strict
improvement, but it is still relatively expensive.

Instead, in sort-based shuffle each mapper writes a single partitioned
file. This allows a particular reducer to request a specific portion of
each mapper's single output file. In more detail, the mapper first fills up
an internal buffer in memory and continually spills the contents of the
buffer to disk, then finally merges all the spilled files together to form
one final output file. This places much less stress on the file system and
requires much fewer I/O operations especially on the read side.


2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <>

> I did check it out and although I did get a general understanding of the
> various classes used to implement Sort and Hash shuffles, however these
> slides lack details as to how they are implemented and why sort generally
> has better performance than hash
> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <>
> wrote:
>> Have a look at this presentation.
>> . Can be
>> of help to you.
>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>> wrote:
>>> What are the major differences between how Sort based and Hash based
>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>> than Hash?
>>> Any talks that discuss both shuffles in detail, how they are implemented
>>> and the performance gains ?

View raw message