spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <>
Subject [jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
Date Thu, 28 Aug 2014 18:16:09 GMT


Josh Rosen commented on SPARK-3280:

Here are some numbers from August 10.  If I recall, this was running on 8 m3.8xlarge nodes.
 This test linearly scales a bunch of parameters (data set size, numbers of mappers and reducers,
etc).  You can see that hash-based shuffle's performance degrades severely in cases where
we have many mappers and reducers, while sort scales much more gracefully:



This was run with spark-perf; here's a sample config for one of the bars:

Java options: -Dspark.serializer=org.apache.spark.serializer.JavaSerializer
-Dspark.locality.wait=60000000 -Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager
Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 --num-partitions=400
--reduce-tasks=400 --random-seed=5 --persistent-type=memory  --num-records=200000000 --unique-keys=20000
--key-length=10 --unique-values=1000000 --value-length=10  --storage-location=hdfs://:9000/spark-perf-kv-data

I'll try to run a better set of tests today.  I plan to look at a few cases that these tests
didn't address, including the performance impact when running on spinning disks, as well as
jobs where we have a large dataset with few mappers and reducers (I think this is the case
that we'd expect to be most favorable to hash-based shuffle).

> Made sort-based shuffle the default implementation
> --------------------------------------------------
>                 Key: SPARK-3280
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
> sort-based shuffle has lower memory usage and seems to outperform hash-based in almost
all of our testing.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message