From issues-return-206592-apmail-spark-issues-archive=spark.apache.org@spark.apache.org Thu Nov 1 19:44:04 2018 Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4794118E02 for ; Thu, 1 Nov 2018 19:44:04 +0000 (UTC) Received: (qmail 94389 invoked by uid 500); 1 Nov 2018 19:44:04 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 94361 invoked by uid 500); 1 Nov 2018 19:44:04 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 94352 invoked by uid 99); 1 Nov 2018 19:44:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2018 19:44:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id BC30D180D90 for ; Thu, 1 Nov 2018 19:44:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 6-xSi-_-hTot for ; Thu, 1 Nov 2018 19:44:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id DAFCD5F524 for ; Thu, 1 Nov 2018 19:44:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2817DE0E3E for ; Thu, 1 Nov 2018 19:44:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id C936627765 for ; Thu, 1 Nov 2018 19:44:00 +0000 (UTC) Date: Thu, 1 Nov 2018 19:44:00 +0000 (UTC) From: "Maryann Xue (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-25914?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-25914: -------------------------------- Description:=20 Currently the Spark SQL logical Aggregate has two expression fields: {{grou= pingExpressions}}=C2=A0and {{aggregateExpressions}}, in which {{aggregateEx= pressions}}=C2=A0is actually the result expressions, or in other words, the= project list in the SELECT clause. =C2=A0 This would cause an exception while processing the following query: {code:java} SELECT concat('x', concat(a, 's')) FROM testData2 GROUP BY concat(a, 's'){code} =C2=A0After optimization, the query becomes: {code:java} SELECT concat('x', a, 's') FROM testData2 GROUP BY concat(a, 's'){code} The optimization rule {{CombineConcats}}=C2=A0optimizes the expressions by = flattening "concat" and causes the query to fail since the expression {{con= cat('x', a, 's')}}=C2=A0in the SELECT clause is neither referencing a group= ing expression nor a aggregate expression. =C2=A0 The problem is that we try to mix two operations in one operator, and wors= e, in one field: the group-and-aggregate operation and the project operatio= n. There are two ways to solve this problem: 1. Break the two operations into two logical operators, which means a grou= p-by query can usually be mapped into a Project-over-Aggregate pattern. 2. Break the two operations into multiple fields in the Aggregate operator= , the same way we do for=C2=A0physical aggregate classes (e.g.,=C2=A0{{Hash= AggregateExec}}, or=C2=A0{{SortAggregateExec}}). Thus,=C2=A0{{groupingExpre= ssions}}=C2=A0would still be the expressions from the GROUP BY clause (as b= efore), but {{aggregateExpressions}}=C2=A0would contain aggregate functions= only, and {{resultExpressions}}=C2=A0would be the project list in the SELE= CT clause holding references to either {{groupingExpressions}}=C2=A0or {{ag= gregateExpressions}}. =C2=A0 I would say option 1 is=C2=A0even clearer, but it would be more likely to = break the pattern matching in existing optimization rules and thus require = more changes in the compiler. So we'd probably wanna go with option 2. That= said, I suggest we achieve this goal through two iterative steps: =C2=A0 Phase=C2=A01: Keep the current fields of logical Aggregate as {{groupingEx= pressions}}=C2=A0and=C2=A0{{aggregateExpressions}}, but change the semantic= s of=C2=A0{{aggregateExpressions}}=C2=A0by=C2=A0replacing the grouping expr= essions with corresponding references to expressions in=C2=A0{{groupingExpr= essions}}. The aggregate expressions in=C2=A0 {{aggregateExpressions}}=C2= =A0will remain the same. =C2=A0 Phase=C2=A02: Add {{resultExpressions}} for the project list, and keep onl= y aggregate expressions in {{aggregateExpressions}}. =C2=A0 was: Currently the Spark SQL logical Aggregate has two expression fields: {{grou= pingExpressions}}=C2=A0and {{aggregateExpressions}}, in which {{aggregateEx= pressions}}=C2=A0is actually the result expressions, or in other words, the= project list in the SELECT clause. =C2=A0 This would cause an exception while processing the following query: {code:java} SELECT concat('x', concat(a, 's')) FROM testData2 GROUP BY concat(a, 's'){code} =C2=A0After optimization, the query becomes: {code:java} SELECT concat('x', a, 's') FROM testData2 GROUP BY concat(a, 's'){code} The optimization rule {{CombineConcats}}=C2=A0optimizes the expressions by = flattening "concat" and causes the query to fail since the expression {{con= cat('x', a, 's')}}=C2=A0in the SELECT clause is neither referencing a group= ing expression nor a aggregate expression. =C2=A0 The problem is that we try to mix two operations in one operator, and wors= e, in one field: the group-and-aggregate operation and the project operatio= n. There are two ways to solve this problem: 1. Break the two operations into two logical operators, which means a grou= p-by query can usually to be mapped into a Project-over-Aggregate pattern. 2. Break the two operations into multiple fields in the Aggregate operator= , the same way we do for=C2=A0physical aggregate classes (e.g.,=C2=A0{{Hash= AggregateExec}}, or=C2=A0{{SortAggregateExec}}). Thus,=C2=A0{{groupingExpre= ssions}}=C2=A0would still be the expressions from the GROUP BY clause (as b= efore), but {{aggregateExpressions}}=C2=A0would contain aggregate functions= only, and {{resultExpressions}}=C2=A0would be the project list in the SELE= CT clause holding references to either {{groupingExpressions}}=C2=A0or {{ag= gregateExpressions}}. =C2=A0 I would say option 1 is=C2=A0even clearer, but it would be more likely to = break the pattern matching in existing optimization rules and thus require = more changes in the compiler. So we'd probably wanna go with option 2. That= said, I suggest we achieve this goal through two iterative steps: =C2=A0 Phase=C2=A01: Keep the current fields of logical Aggregate as {{groupingEx= pressions}}=C2=A0and=C2=A0{{aggregateExpressions}}, but change the semantic= s of=C2=A0{{aggregateExpressions}}=C2=A0by=C2=A0replacing the grouping expr= essions with corresponding references to expressions in=C2=A0{{groupingExpr= essions}}. The aggregate expressions in=C2=A0 {{aggregateExpressions}}=C2= =A0will remain the same. =C2=A0 Phase=C2=A02: Add {{resultExpressions}} for the project list, and keep onl= y aggregate expressions in {{aggregateExpressions}}. =C2=A0 > Separate projection from grouping and aggregate in logical Aggregate > -------------------------------------------------------------------- > > Key: SPARK-25914 > URL: https://issues.apache.org/jira/browse/SPARK-25914 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: Maryann Xue > Priority: Major > > Currently the Spark SQL logical Aggregate has two expression fields: {{gr= oupingExpressions}}=C2=A0and {{aggregateExpressions}}, in which {{aggregate= Expressions}}=C2=A0is actually the result expressions, or in other words, t= he project list in the SELECT clause. > =C2=A0 > This would cause an exception while processing the following query: > {code:java} > SELECT concat('x', concat(a, 's')) > FROM testData2 > GROUP BY concat(a, 's'){code} > =C2=A0After optimization, the query becomes: > {code:java} > SELECT concat('x', a, 's') > FROM testData2 > GROUP BY concat(a, 's'){code} > The optimization rule {{CombineConcats}}=C2=A0optimizes the expressions b= y flattening "concat" and causes the query to fail since the expression {{c= oncat('x', a, 's')}}=C2=A0in the SELECT clause is neither referencing a gro= uping expression nor a aggregate expression. > =C2=A0 > The problem is that we try to mix two operations in one operator, and wo= rse, in one field: the group-and-aggregate operation and the project operat= ion. There are two ways to solve this problem: > 1. Break the two operations into two logical operators, which means a gr= oup-by query can usually be mapped into a Project-over-Aggregate pattern. > 2. Break the two operations into multiple fields in the Aggregate operat= or, the same way we do for=C2=A0physical aggregate classes (e.g.,=C2=A0{{Ha= shAggregateExec}}, or=C2=A0{{SortAggregateExec}}). Thus,=C2=A0{{groupingExp= ressions}}=C2=A0would still be the expressions from the GROUP BY clause (as= before), but {{aggregateExpressions}}=C2=A0would contain aggregate functio= ns only, and {{resultExpressions}}=C2=A0would be the project list in the SE= LECT clause holding references to either {{groupingExpressions}}=C2=A0or {{= aggregateExpressions}}. > =C2=A0 > I would say option 1 is=C2=A0even clearer, but it would be more likely t= o break the pattern matching in existing optimization rules and thus requir= e more changes in the compiler. So we'd probably wanna go with option 2. Th= at said, I suggest we achieve this goal through two iterative steps: > =C2=A0 > Phase=C2=A01: Keep the current fields of logical Aggregate as {{grouping= Expressions}}=C2=A0and=C2=A0{{aggregateExpressions}}, but change the semant= ics of=C2=A0{{aggregateExpressions}}=C2=A0by=C2=A0replacing the grouping ex= pressions with corresponding references to expressions in=C2=A0{{groupingEx= pressions}}. The aggregate expressions in=C2=A0 {{aggregateExpressions}}=C2= =A0will remain the same. > =C2=A0 > Phase=C2=A02: Add {{resultExpressions}} for the project list, and keep o= nly aggregate expressions in {{aggregateExpressions}}. > =C2=A0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org