spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "linna shuang (Jira)" <>
Subject [jira] [Commented] (SPARK-16951) Alternative implementation of NOT IN to Anti-join
Date Fri, 08 May 2020 07:43:00 GMT


linna shuang commented on SPARK-16951:

In TPC-H test, we met performance issue of Q16, which used NOT IN subquery and being translated
into broadcast nested loop join. This query uses almost half time of total 22 queries. For
example, 512GB data set, totally execution time is 1400 seconds, while Q16’s execution time
is 630 seconds.

TPC-H is a common spark sql performance benchmark, this performance issue will be met usually.
Do you have plan to reopen and fix this issue?

> Alternative implementation of NOT IN to Anti-join
> -------------------------------------------------
>                 Key: SPARK-16951
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>            Priority: Major
> A transformation currently used to process {{NOT IN}} subquery is to rewrite to a form
of Anti-join with null-aware property in the Logical Plan and then translate to a form of
{{OR}} predicate joining the parent side and the subquery side of the {{NOT IN}}. As a result,
the presence of {{OR}} predicate is limited to the nested-loop join execution plan, which
will have a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to the technique
used in the implementation of the Existence join that addresses the problem of {{EXISTS (..)
OR ..}} type of queries.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message