spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: How does shuffle work in spark ?
Date Thu, 16 Jan 2014 20:16:24 GMT
The intermediate shuffle output gets written to disk, but it often
hits the OS-buffer cache since it's not explicitly fsync'ed, so in
many cases it stays entirely in memory. The behavior of the shuffle is
agnostic to whether the base RDD is in cache or in disk.

For on-disk RDD's or inputs, the shuffle path still has some key
differences with Hadoop's implementation, including that it doesn't
sort on the map side before shuffling.

- Patrick

On Thu, Jan 16, 2014 at 6:24 AM, suman bharadwaj <suman.dna@gmail.com> wrote:
> Hi,
>
> Is this behavior the same when the data is in memory ?
> If the data is stored to disk, then how is it different than Hadoop map
> reduce ?
>
> Regards,
> SB
>
>
> On Thu, Jan 16, 2014 at 5:11 PM, Archit Thakur <archit279thakur@gmail.com>
> wrote:
>>
>> For any shuffle operation, groupByKey, etc. it does write map output to
>> disk before performing the reduce task on the data.
>>
>>
>> On Thu, Jan 16, 2014 at 4:03 PM, suman bharadwaj <suman.dna@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm new to spark. And wanted to understand more on how shuffle works in
>>> spark
>>>
>>> In Hadoop map reduce, while performing a reduce operation, the
>>> intermediate data from map gets written to disk. How does the same happen in
>>> Spark ?
>>>
>>> Does spark write the intermediate data to disk ?
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> SB
>>
>>
>

Mime
View raw message