spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?
Date Mon, 11 Nov 2019 03:44:01 GMT
Yea codegen can be a good improvement, PRs are welcome!

On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang <gwang3@ebay.com> wrote:

> That’s right. By default, Spark prefers sort merge join.
>
> While, in our product environment, there are many huge bucket tables. We
> can leverage the bucketing to avoid shuffle when join with other small
> tables (the small tables are not small enough to leverage broad cast join).
> Problem is that, although shuffle can be avoid, sort is still necessary to
> leverage sort merge join (we cannot pre-sort since there are different join
> patterns). For a huge table, sort may take even tens of seconds.
>
> That’s why I’m trying to enable shuffle hash join, and for such cases,
> there were almost 10% ~ 20% improvement when apply shuffle hash join
> instead of sort merge join. I wonder if there is still some space to
> improve shuffle hash join? Like code generation for ShuffledHashJoinExec or
> something….
>
>
>
> *From: *Wenchen Fan <cloud0fan@gmail.com>
> *Date: *Sunday, November 10, 2019 at 5:57 PM
> *To: *"Wang, Gang" <gwang3@ebay.com.invalid>
> *Cc: *"dev@spark.apache.org" <dev@spark.apache.org>
> *Subject: *Re: Why not implement CodegenSupport in class
> ShuffledHashJoinExec?
>
>
>
> By default sort merge join is preferred over shuffle hash join, that's why
> we haven't spend resources to implement codegen for it.
>
>
>
> On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang <gwang3@ebay.com.invalid>
> wrote:
>
> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that ShuffledHashJoinExec does not implement
> CodegenSupport, is there any concern? And if there is any chance to improve
> the performance of ShuffledHashJoinExec?
>
>
>
>

Mime
View raw message