spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23957) Sorts in subqueries are redundant and can be removed
Date Thu, 12 Apr 2018 04:34:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434918#comment-16434918
] 

Apache Spark commented on SPARK-23957:
--------------------------------------

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/21049

> Sorts in subqueries are redundant and can be removed
> ----------------------------------------------------
>
>                 Key: SPARK-23957
>                 URL: https://issues.apache.org/jira/browse/SPARK-23957
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Henry Robinson
>            Priority: Major
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned and optimized
subqueries should have any sort operators (since the result of the subquery is an unordered
collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>    +- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>       +- *(2) Project
>          +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
>             +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>                +- *(1) Project [id#0L]
>                   +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location:
InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters:
[], ReadSchema: struct<id:bigint>
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>    +- *(1) Project [id#0L]
>       +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a subquery. SPARK-23375
is related, but removes sorts that are functionally redundant because they perform the same
ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message