spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19503) Dumb Execution Plan
Date Tue, 07 Feb 2017 22:29:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856931#comment-15856931
] 

Sean Owen commented on SPARK-19503:
-----------------------------------

Can you improve the JIRA? the title is uninformative. See http://spark.apache.org/contributing.html

> Dumb Execution Plan
> -------------------
>
>                 Key: SPARK-19503
>                 URL: https://issues.apache.org/jira/browse/SPARK-19503
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 2.1.0
>         Environment: Perhaps only a pyspark or databricks AWS issue
>            Reporter: R
>            Priority: Minor
>              Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not required here
and makes me wonder how smart the algebraic optimiser is indeed! The data may be partitioned
by known count (such as parquet files) and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder what else is
it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message