spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: [SQL] Syntax "case when" doesn't be supported in JOIN
Date Mon, 17 Jul 2017 08:19:41 GMT

I created a draft pull request for explaining the cases:
https://github.com/apache/spark/pull/18652



Chang Chen wrote
> Hi All
> 
> I don't understand the difference between the semantics, I found Spark
> does
> the same thing for GroupBy non-deterministic. From Map-Reduce point of
> view, Join is also GroupBy in essence .
> 
> @Liang Chi Hsieh
> &lt;https://plus.google.com/u/0/103179362592085650735?prsrc=4&gt;
> 
> in which situation,  semantics  will be changed?
> 
> Thanks
> Chang
> 
> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh &lt;

> viirya@

> &gt; wrote:
> 
>>
>> Thinking about it more, I think it changes the semantics only under
>> certain
>> scenarios.
>>
>> For the example SQL query shown in previous discussion, it looks the same
>> semantics.
>>
>>
>> Xiao Li wrote
>> > If the join condition is non-deterministic, pushing it down to the
>> > underlying project will change the semantics. Thus, we are unable to do
>> it
>> > in PullOutNondeterministic. Users can do it manually if they do not
>> care
>> > the semantics difference.
>> >
>> > Thanks,
>> >
>> > Xiao
>> >
>> >
>> >
>> > 2017-07-16 20:07 GMT-07:00 Chang Chen &lt;
>>
>> > baibaichen@
>>
>> > &gt;:
>> >
>> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
>> >> And
>> >> this workaround is equivalent  to insert a Project between Join
>> operator
>> >> and its child.
>> >>
>> >> Why not do it in PullOutNondeterministic?
>> >>
>> >> Thanks
>> >> Chang
>> >>
>> >>
>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh &lt;
>>
>> > viirya@
>>
>> > &gt; wrote:
>> >>
>> >>>
>> >>> A possible workaround is to add the rand column into tbl1 with a
>> >>> projection
>> >>> before the join.
>> >>>
>> >>> SELECT a.col1
>> >>> FROM (
>> >>>   SELECT col1,
>> >>>     CASE
>> >>>          WHEN col2 IS NULL
>> >>>            THEN cast(rand(9)*1000 - 9999999999 as string)
>> >>>          ELSE
>> >>>            col2
>> >>>     END AS col2
>> >>>     FROM tbl1) a
>> >>> LEFT OUTER JOIN tbl2 b
>> >>> ON a.col2 = b.col3;
>> >>>
>> >>>
>> >>>
>> >>> Chang Chen wrote
>> >>> > Hi Wenchen
>> >>> >
>> >>> > Yes. We also find this error is caused by Rand. However, this is
>> >>> classic
>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>> Spark?
>> >>> >
>> >>> > Thanks
>> >>> > Chang
>> >>> >
>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;
>> >>>
>> >>> > cloud0fan@
>> >>>
>> >>> > &gt; wrote:
>> >>> >
>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>> >>> expressions
>> >>> >> are not allowed in join condition.
>> >>> >>
>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang &lt;
>> >>>
>> >>> > cn_wss@
>> >>>
>> >>> > &gt; wrote:
>> >>> >> >
>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>> >>> >> thriftserver), For
>> >>> >> > optimizing data skew, we use "case when" to handle null.
>> >>> >> > Simple sql as following:
>> >>> >> >
>> >>> >> >
>> >>> >> > SELECT a.col1
>> >>> >> > FROM tbl1 a
>> >>> >> > LEFT OUTER JOIN tbl2 b
>> >>> >> > ON
>> >>> >> > *     CASE
>> >>> >> >               WHEN a.col2 IS NULL
>> >>> >> >                       TNEN cast(rand(9)*1000 - 9999999999
as
>> >>> string)
>> >>> >> >               ELSE
>> >>> >> >                       a.col2 END *
>> >>> >> >       = b.col3;
>> >>> >> >
>> >>> >> >
>> >>> >> > But I get the error:
>> >>> >> >
>> >>> >> > == Physical Plan ==
>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>> >>> expressions
>> >>> >> are
>> >>> >> > only allowed in
>> >>> >> > Project, Filter, Aggregate or Window, found:*
>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9)
*
>> CAST(1000
>> >>> AS
>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE
>> >>> a.`nav_tcdt`
>> >>> >> END
>> >>> >> =
>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9))
AND
>> >>> >> (c.`cur_flag`
>> >>> >> =
>> >>> >> > 1))
>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25)
>> THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999
as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>> >> >               ;;
>> >>> >> > GlobalLimit 10
>> >>> >> > +- LocalLimit 10
>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10
as string)
>> IN
>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997
as
>> string))
>> >>> &&
>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
>> >>> END],
>> >>> >> > [date_id#7]
>> >>> >> >      +- Filter (date_id#7 = 2017-07-12)
>> >>> >> >         +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25)
THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999
as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>> >> >            :- SubqueryAlias a
>> >>> >> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >>> >> >            :     +- CatalogRelation
>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> >>> >> chanl_id#8L,
>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> >>> >> nav_refer_page_type_id#13,
>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> >>> >> nav_page_value#20,
>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> >>> >> nav_tcd#26,
>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> >>> >> > detl_refer_page_value#30, ... 33 more fields]
>> >>> >> >            +- SubqueryAlias c
>> >>> >> >               +- SubqueryAlias dim_site_categ_ext
>> >>> >> >                  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> >>> >> [site_categ_skid#64L,
>> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> >>> >> sort_seq#71L,
>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74,
>> >>> etl_batch_id#75L,
>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L,
>> >>> bkgrnd_categ_id#79L,
>> >>> >> > site_categ_id#80, site_categ_parnt_id#81]
>> >>> >> >
>> >>> >> > Does spark sql not support syntax "case when" in JOIN?
>> Additional,
>> >>> my
>> >>> >> spark
>> >>> >> > version is 2.2.0.
>> >>> >> > Any help would be greatly appreciated.
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > --
>> >>> >> > View this message in context: http://apache-spark-developers
>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
>> >>> >> be-supported-in-JOIN-tp21953.html
>> >>> >> > Sent from the Apache Spark Developers List mailing list
archive
>> at
>> >>> >> Nabble.com.
>> >>> >> >
>> >>> >> > ------------------------------------------------------------
>> >>> ---------
>> >>> >> > To unsubscribe e-mail:
>> >>>
>> >>> > dev-unsubscribe@.apache
>> >>>
>> >>> >> >
>> >>> >>
>> >>> >>
>> >>> >> ------------------------------------------------------------
>> ---------
>> >>> >> To unsubscribe e-mail:
>> >>>
>> >>> > dev-unsubscribe@.apache
>> >>>
>> >>> >>
>> >>> >>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> -----
>> >>> Liang-Chi Hsieh | @viirya
>> >>> Spark Technology Center
>> >>> http://www.spark.tc/
>> >>> --
>> >>> View this message in context: http://apache-spark-developers
>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>> >>> supported-in-JOIN-tp21953p21961.html
>> >>> Sent from the Apache Spark Developers List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>> >>>
>> >>>
>> >>
>>
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/SQL-Syntax-case-
>> when-doesn-t-be-supported-in-JOIN-tp21953p21973.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21976.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message