spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darin McBeath <>
Subject Confused about shuffle read and shuffle write
Date Wed, 21 Jan 2015 13:38:06 GMT
 I have the following code in a Spark Job.

// Get the baseline input file(s)
JavaPairRDD<Text,Text> hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile,
SequenceFileInputFormat.class, Text.class, Text.class);
JavaPairRDD<String, String> hsfBaselinePairRDD = hsfBaselinePairRDDReadable.mapToPair(newConvertFromWritableTypes()).partitionBy(newHashPartitioner(Variables.NUM_OUTPUT_PARTITIONS)).persist(StorageLevel.MEMORY_ONLY_SER());

// Use 'substring' to extract epoch values.
JavaPairRDD<String, Long> baselinePairRDD = hsfBaselinePairRDD.mapValues(newExtractEpoch(accumBadBaselineRecords)).persist(StorageLevel.MEMORY_ONLY_SER());

When looking at the STAGE information for my job, I notice the following:

To construct hsfBaselinePairRDD, it takes about 7.5 minutes, with 931GB of input (from S3)
and 377GB of shuffle write (presumably because of the hash partitioning).  This all makes

To construct the baselinePairRDD, it also takes about 7.5 minutes.  I thought that was a bit
odd.  But what I thought was really odd is why there was also 330GB of shuffle read in this
stage.  I would have thought there should be 0 shuffle read in this stage.  

What I'm confused about is why there is even any 'shuffle read' when constructing the baselinePairRDD.
 If anyone could shed some light on this it would be appreciated.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message