spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leanken.Lin (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-32290) NotInSubquery SingleColumn Optimize
Date Mon, 13 Jul 2020 07:44:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Leanken.Lin updated SPARK-32290:
--------------------------------
    Fix Version/s: 3.0.1

> NotInSubquery SingleColumn Optimize
> -----------------------------------
>
>                 Key: SPARK-32290
>                 URL: https://issues.apache.org/jira/browse/SPARK-32290
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Priority: Minor
>             Fix For: 3.0.1, 3.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Normally,
> A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very very time consuming.
For example, I've done TPCH benchmark lately, Query 16 almost took half of the entire TPCH
22Query execution Time. So i proposed that to do the following optimize.
> Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only single
column in following pattern.
> {code:java}
> case _@Or(
>             _@EqualTo(leftAttr: AttributeReference, rightAttr: AttributeReference),
>             _@IsNull(
>               _@EqualTo(_: AttributeReference, _: AttributeReference)
>             )
>           )
> {code}
> if buildSide rows is small enough, we can change build side data into a HashMap.
> so the M*N calculation can be optimized into M*log(N)
> I've done a benchmark job in 1TB TPCH, before apply the optimize
> Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it takes only
30s to finish. 
> But this optimize only works on single column not in subquery, so i am here to seek advise
whether the community need this update or not. I will do the pull request first, if the community
member thought it's hack, it's fine to just ignore this request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message