spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brett Marcott <brett.marc...@gmail.com>
Subject SortMergeJoinExec: Utilizing child partitioning when joining
Date Wed, 01 Jan 2020 07:48:48 GMT
Hi all,

I found this jira for an issue I ran into recently:
https://issues.apache.org/jira/browse/SPARK-28771

My initial idea for a fix is to change SortMergeJoinExec's (and
ShuffledHashJoinExec) requiredChildDistribution.

At least if all below conditions are met, we could only require a subset of
keys for partitioning:
left and right children's output partitionings are hashpartitioning with
same numpartitions
left and right partition expressions have the same subset (with regards to
indices) of their respective join keys

If that subset of keys is returned by requiredChildDistribution,
then EnsureRequirements.ensureDistributionAndOrdering would not add a
shuffle stage, hence reusing the children's partitioning.

1.Thoughts on this approach?

2. Could someone help explain why the different join types have different
output partitionings in SortMergeJoinExec.outputPartitioning
<https://github.com/apache/spark/blob/cdcd43cbf2479b258f4c5cfa0f6306f475d25cf2/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L85-L96>
?

Thanks,
Brett

Mime
View raw message