spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Li (JIRA)" <>
Subject [jira] [Assigned] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate
Date Thu, 01 Nov 2018 19:57:00 GMT


Xiao Li reassigned SPARK-25914:

    Assignee: Dilip Biswal

> Separate projection from grouping and aggregate in logical Aggregate
> --------------------------------------------------------------------
>                 Key: SPARK-25914
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Maryann Xue
>            Assignee: Dilip Biswal
>            Priority: Major
> Currently the Spark SQL logical Aggregate has two expression fields: {{groupingExpressions}} and
{{aggregateExpressions}}, in which {{aggregateExpressions}} is actually the result expressions,
or in other words, the project list in the SELECT clause.
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by flattening "concat"
and causes the query to fail since the expression {{concat('x', a, 's')}} in the SELECT clause
is neither referencing a grouping expression nor a aggregate expression.
>  The problem is that we try to mix two operations in one operator, and worse, in one
field: the group-and-aggregate operation and the project operation. There are two ways to
solve this problem:
>  1. Break the two operations into two logical operators, which means a group-by query
can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, the same
way we do for physical aggregate classes (e.g., {{HashAggregateExec}}, or {{SortAggregateExec}}).
Thus, {{groupingExpressions}} would still be the expressions from the GROUP BY clause (as
before), but {{aggregateExpressions}} would contain aggregate functions only, and {{resultExpressions}} would
be the project list in the SELECT clause holding references to either {{groupingExpressions}} or
>  I would say option 1 is even clearer, but it would be more likely to break the pattern
matching in existing optimization rules and thus require more changes in the compiler. So
we'd probably wanna go with option 2. That said, I suggest we achieve this goal through two
iterative steps:
>  Phase 1: Keep the current fields of logical Aggregate as {{groupingExpressions}} and {{aggregateExpressions}},
but change the semantics of {{aggregateExpressions}} by replacing the grouping expressions
with corresponding references to expressions in {{groupingExpressions}}. The aggregate expressions
in  {{aggregateExpressions}} will remain the same.
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only aggregate expressions
in {{aggregateExpressions}}.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message