spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lsn24 <lekshmi.s...@gmail.com>
Subject SortMerge Join on partitioned column causes shuffle
Date Wed, 27 Mar 2019 00:29:22 GMT
Hello,

 We got two datasets thats been persisted as follows:

Dataset A:
datasetA.repartition(5,datasetA.col("region"))
                .write().mode(saveMode)
                .format("parquet")
                .partitionBy("region")
                .bucketBy(5,"studentId")
                .sortBy("studentId")
                .option("path", parquetFilesDirectory)
                .saveAsTable( database.tableA));

Dataset B:
datasetB.repartition(5,datasetB.col("region"))
                .write().mode(saveMode)
                .format("parquet")
                .partitionBy("region")
                .bucketBy(5,"studentId")
                .sortBy("studentId")
                .option("path", parquetFilesDirectory)
                .saveAsTable( database.tableB));


When I do a  join with region and studentId , I see shuffle. If I do join
just with the bucketed column studentId, there is NO shuffle as expected.
Below is the join query.

spark.sql("Select *  from  database.tableA").join(spark.sql("Select *  from  
database.tableB "), Seq("studentId","region")).show(10)

What could be the reason for the shuffle when we include the partitionkey
and how can we mitigate it ?

Thanks






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message