spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: How does shuffle work in spark ?
Date Thu, 16 Jan 2014 21:32:28 GMT
And if you are willing to take on yourself the responsibility of
guaranteeing that your mapping preserves partitioning, then you can look to
use mapPartitions, mapPartitionsWithIndex, mapPartitionsWithContext or
mapWith, each of which has a preservesPartitioning parameter that you can
set to true to avoid losing the partitioner.


On Thu, Jan 16, 2014 at 1:08 PM, Ewen Cheslack-Postava <me@ewencp.org>wrote:

> The difference between a shuffle dependency and a transformation that can
> cause a shuffle is probably worth pointing out.
>
> The mentioned transformations (groupByKey, join, etc) *might* generate a
> shuffle dependency on input RDDs, but they won't necessarily. For example,
> if you join() two RDDs that already use the same partitioner (e.g. a
> default HashPartitioner with the default parallelism), then no shuffle
> needs to be performed (and nothing should hit disk). Any records that need
> to be considered together will already be in the same partitions of the
> input RDDs (e.g. all records with key X are guaranteed to be in partition
> hash(X) of both input RDDs, so no shuffling is needed).
>
> Sometimes this is *really* worth exploiting, and even if it only applies
> to one of the input RDDs. For example, if you're joining 2 RDDs and one is
> much larger than the other and already partitioned, you can explicitly use
> the partitioner from the larger RDD so that only the smaller RDD gets
> shuffled.
>
> This also means you probably want to pay attention to transformations that
> remove partitioners. For example, prefer mapValues() to map(). mapValues()
> has to maintain the same key, so the output is guaranteed to still be
> partitioned. map() can change the keys, so partitioning is lost even if you
> keep the same key.
>
> -Ewen
>
>   Patrick Wendell <pwendell@gmail.com>
>  January 16, 2014 12:16 PM
> The intermediate shuffle output gets written to disk, but it often
> hits the OS-buffer cache since it's not explicitly fsync'ed, so in
> many cases it stays entirely in memory. The behavior of the shuffle is
> agnostic to whether the base RDD is in cache or in disk.
>
> For on-disk RDD's or inputs, the shuffle path still has some key
> differences with Hadoop's implementation, including that it doesn't
> sort on the map side before shuffling.
>
> - Patrick
>   suman bharadwaj <suman.dna@gmail.com>
>  January 16, 2014 6:24 AM
> Hi,
>
> Is this behavior the same when the data is in memory ?
> If the data is stored to disk, then how is it different than Hadoop map
> reduce ?
>
> Regards,
> SB
>
>
>
>   Archit Thakur <archit279thakur@gmail.com>
>  January 16, 2014 3:41 AM
> For any shuffle operation, groupByKey, etc. it does write map output to
> disk before performing the reduce task on the data.
>
>
>
>   suman bharadwaj <suman.dna@gmail.com>
>  January 16, 2014 2:33 AM
> Hi,
>
> I'm new to spark. And wanted to understand more on how shuffle works in
> spark
>
> In Hadoop map reduce, while performing a reduce operation, the
> intermediate data from map gets written to disk. How does the same happen
> in Spark ?
>
> Does spark write the intermediate data to disk ?
>
> Thanks in advance.
>
> Regards,
> SB
>
>

Mime
View raw message