spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maryann Xue (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate
Date Thu, 01 Nov 2018 19:33:00 GMT
Maryann Xue created SPARK-25914:
-----------------------------------

             Summary: Separate projection from grouping and aggregate in logical Aggregate
                 Key: SPARK-25914
                 URL: https://issues.apache.org/jira/browse/SPARK-25914
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Maryann Xue


Currently the Spark SQL logical Aggregate has two expression fields: {{groupingExpressions}} and
{{aggregateExpressions}}, in which {{aggregateExpressions}} is actually the result expressions,
or in other words, the project list in the SELECT clause.
 
This would cause an exception while processing the following query:
 
  SELECT concat('x', concat(a, 's'))
  FROM testData2
  GROUP BY concat(a, 's')
 
After optimization, the query becomes:
 
  SELECT concat('x', a, 's')
  FROM testData2
  GROUP BY concat(a, 's')
 
The optimization rule {{CombineConcats}} optimizes the expressions by flattening "concat"
and causes the query to fail since the expression {{concat('x', a, 's')}} in the SELECT clause
is neither referencing a grouping expression nor a aggregate expression.
 
The problem is that we try to mix two operations in one operator, and worse, in one field:
the group-and-aggregate operation and the project operation. There are two ways to solve this
problem:
1. Break the two operations into two logical operators, which means a group-by query can usually
to be mapped into a Project-over-Aggregate pattern.
2. Break the two operations into multiple fields in the Aggregate operator, the same way we
do for physical aggregate classes (e.g., {{HashAggregateExec}}, or {{SortAggregateExec}}).
Thus, {{groupingExpressions}} would still be the expressions from the GROUP BY clause (as
before), but {{aggregateExpressions}} would contain aggregate functions only, and {{resultExpressions}} would
be the project list in the SELECT clause holding references to either {{groupingExpressions}} or
{{aggregateExpressions}}.
 
I would say option 1 is even clearer, but it would be more likely to break the pattern matching
in existing optimization rules and thus require more changes in the compiler. So we'd probably
wanna go with option 2. That said, I suggest we achieve this goal through two iterative steps:
 
Phase 1: Keep the current fields of logical Aggregate as {{groupingExpressions}} and {{aggregateExpressions}},
but change the semantics of {{aggregateExpressions}} by replacing the grouping expressions
with corresponding references to expressions in {{groupingExpressions}}. The aggregate expressions
in  {{aggregateExpressions}} will remain the same.
 
Phase 2: Add {{resultExpressions}} for the project list, and keep only aggregate expressions
in {{aggregateExpressions}}.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message