spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maryann Xue (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate
Date Thu, 01 Nov 2018 19:44:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Maryann Xue updated SPARK-25914:
--------------------------------
    Description: 
Currently the Spark SQL logical Aggregate has two expression fields: {{groupingExpressions}} and
{{aggregateExpressions}}, in which {{aggregateExpressions}} is actually the result expressions,
or in other words, the project list in the SELECT clause.
  
 This would cause an exception while processing the following query:
{code:java}
SELECT concat('x', concat(a, 's'))
FROM testData2
GROUP BY concat(a, 's'){code}
 After optimization, the query becomes:
{code:java}
SELECT concat('x', a, 's')
FROM testData2
GROUP BY concat(a, 's'){code}
The optimization rule {{CombineConcats}} optimizes the expressions by flattening "concat"
and causes the query to fail since the expression {{concat('x', a, 's')}} in the SELECT clause
is neither referencing a grouping expression nor a aggregate expression.
  
 The problem is that we try to mix two operations in one operator, and worse, in one field:
the group-and-aggregate operation and the project operation. There are two ways to solve this
problem:
 1. Break the two operations into two logical operators, which means a group-by query can
usually be mapped into a Project-over-Aggregate pattern.
 2. Break the two operations into multiple fields in the Aggregate operator, the same way
we do for physical aggregate classes (e.g., {{HashAggregateExec}}, or {{SortAggregateExec}}).
Thus, {{groupingExpressions}} would still be the expressions from the GROUP BY clause (as
before), but {{aggregateExpressions}} would contain aggregate functions only, and {{resultExpressions}} would
be the project list in the SELECT clause holding references to either {{groupingExpressions}} or
{{aggregateExpressions}}.
  
 I would say option 1 is even clearer, but it would be more likely to break the pattern matching
in existing optimization rules and thus require more changes in the compiler. So we'd probably
wanna go with option 2. That said, I suggest we achieve this goal through two iterative steps:
  
 Phase 1: Keep the current fields of logical Aggregate as {{groupingExpressions}} and {{aggregateExpressions}},
but change the semantics of {{aggregateExpressions}} by replacing the grouping expressions
with corresponding references to expressions in {{groupingExpressions}}. The aggregate expressions
in  {{aggregateExpressions}} will remain the same.
  
 Phase 2: Add {{resultExpressions}} for the project list, and keep only aggregate expressions
in {{aggregateExpressions}}.
  

  was:
Currently the Spark SQL logical Aggregate has two expression fields: {{groupingExpressions}} and
{{aggregateExpressions}}, in which {{aggregateExpressions}} is actually the result expressions,
or in other words, the project list in the SELECT clause.
  
 This would cause an exception while processing the following query:
{code:java}
SELECT concat('x', concat(a, 's'))
FROM testData2
GROUP BY concat(a, 's'){code}
 After optimization, the query becomes:
{code:java}
SELECT concat('x', a, 's')
FROM testData2
GROUP BY concat(a, 's'){code}
The optimization rule {{CombineConcats}} optimizes the expressions by flattening "concat"
and causes the query to fail since the expression {{concat('x', a, 's')}} in the SELECT clause
is neither referencing a grouping expression nor a aggregate expression.
  
 The problem is that we try to mix two operations in one operator, and worse, in one field:
the group-and-aggregate operation and the project operation. There are two ways to solve this
problem:
 1. Break the two operations into two logical operators, which means a group-by query can
usually to be mapped into a Project-over-Aggregate pattern.
 2. Break the two operations into multiple fields in the Aggregate operator, the same way
we do for physical aggregate classes (e.g., {{HashAggregateExec}}, or {{SortAggregateExec}}).
Thus, {{groupingExpressions}} would still be the expressions from the GROUP BY clause (as
before), but {{aggregateExpressions}} would contain aggregate functions only, and {{resultExpressions}} would
be the project list in the SELECT clause holding references to either {{groupingExpressions}} or
{{aggregateExpressions}}.
  
 I would say option 1 is even clearer, but it would be more likely to break the pattern matching
in existing optimization rules and thus require more changes in the compiler. So we'd probably
wanna go with option 2. That said, I suggest we achieve this goal through two iterative steps:
  
 Phase 1: Keep the current fields of logical Aggregate as {{groupingExpressions}} and {{aggregateExpressions}},
but change the semantics of {{aggregateExpressions}} by replacing the grouping expressions
with corresponding references to expressions in {{groupingExpressions}}. The aggregate expressions
in  {{aggregateExpressions}} will remain the same.
  
 Phase 2: Add {{resultExpressions}} for the project list, and keep only aggregate expressions
in {{aggregateExpressions}}.
  


> Separate projection from grouping and aggregate in logical Aggregate
> --------------------------------------------------------------------
>
>                 Key: SPARK-25914
>                 URL: https://issues.apache.org/jira/browse/SPARK-25914
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Maryann Xue
>            Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: {{groupingExpressions}} and
{{aggregateExpressions}}, in which {{aggregateExpressions}} is actually the result expressions,
or in other words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by flattening "concat"
and causes the query to fail since the expression {{concat('x', a, 's')}} in the SELECT clause
is neither referencing a grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, in one
field: the group-and-aggregate operation and the project operation. There are two ways to
solve this problem:
>  1. Break the two operations into two logical operators, which means a group-by query
can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, the same
way we do for physical aggregate classes (e.g., {{HashAggregateExec}}, or {{SortAggregateExec}}).
Thus, {{groupingExpressions}} would still be the expressions from the GROUP BY clause (as
before), but {{aggregateExpressions}} would contain aggregate functions only, and {{resultExpressions}} would
be the project list in the SELECT clause holding references to either {{groupingExpressions}} or
{{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break the pattern
matching in existing optimization rules and thus require more changes in the compiler. So
we'd probably wanna go with option 2. That said, I suggest we achieve this goal through two
iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as {{groupingExpressions}} and {{aggregateExpressions}},
but change the semantics of {{aggregateExpressions}} by replacing the grouping expressions
with corresponding references to expressions in {{groupingExpressions}}. The aggregate expressions
in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only aggregate expressions
in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message