spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taylor Cox <Taylor....@microsoft.com.INVALID>
Subject RE: Shuffle write explosion
Date Mon, 05 Nov 2018 21:41:01 GMT
At first glance, I wonder if your tables are partitioned? There may not be enough parallelism
happening. You can also pass in the number of partitions and/or a custom partitioner to help
Spark “guess” how to organize the shuffle.

Have you seen any of these docs?
https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
https://spark.apache.org/docs/latest/tuning.html

Taylor


From: Yichen Zhou <zhouycsf@gmail.com>
Sent: Sunday, November 4, 2018 11:42 PM
To: user@spark.apache.org
Subject: Shuffle write explosion

Hi All,

When running a spark job, I have 100MB+ input and get more than 700GB shuffle write, which
is really weird. And this job finally end up with the OOM error. Does anybody know why this
happened?
[Screen Shot 2018-11-05 at 15.20.35.png]
My code is like:
JavaPairRDD<Text, Text> inputRDD = sc.sequenceFile(inputPath, Text.class, Text.class);
 inputRDD.repartition(partitionNum).mapToPair(...).saveAsNewAPIHadoopDataset(job.getConfiguration());

Environment:
CPU 32 core; Memory 256G; Storage 7.5G
CentOS 7.5
java version "1.8.0_162"
Spark 2.1.2

Any help is greatly appreciated.

Regards,
Yichen
Mime
View raw message