spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: [SQL] Syntax "case when" doesn't be supported in JOIN
Date Fri, 14 Jul 2017 09:29:36 GMT

A possible workaround is to add the rand column into tbl1 with a projection
before the join.

SELECT a.col1
FROM (
  SELECT col1,
    CASE
         WHEN col2 IS NULL
           THEN cast(rand(9)*1000 - 9999999999 as string)
         ELSE
           col2
    END AS col2
    FROM tbl1) a
LEFT OUTER JOIN tbl2 b
ON a.col2 = b.col3;



Chang Chen wrote
> Hi Wenchen
> 
> Yes. We also find this error is caused by Rand. However, this is classic
> way to solve data skew in Hive.  Is there any equivalent way in Spark?
> 
> Thanks
> Chang
> 
> On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;

> cloud0fan@

> &gt; wrote:
> 
>> It’s not about case when, but about rand(). Non-deterministic expressions
>> are not allowed in join condition.
>>
>> > On 13 Jul 2017, at 6:43 PM, wangshuang &lt;

> cn_wss@

> &gt; wrote:
>> >
>> > I'm trying to execute hive sql on spark sql (Also on spark
>> thriftserver), For
>> > optimizing data skew, we use "case when" to handle null.
>> > Simple sql as following:
>> >
>> >
>> > SELECT a.col1
>> > FROM tbl1 a
>> > LEFT OUTER JOIN tbl2 b
>> > ON
>> > *     CASE
>> >               WHEN a.col2 IS NULL
>> >                       TNEN cast(rand(9)*1000 - 9999999999 as string)
>> >               ELSE
>> >                       a.col2 END *
>> >       = b.col3;
>> >
>> >
>> > But I get the error:
>> >
>> > == Physical Plan ==
>> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
>> are
>> > only allowed in
>> > Project, Filter, Aggregate or Window, found:*
>> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
>> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
>> END
>> =
>> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> (c.`cur_flag`
>> =
>> > 1))
>> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26
as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >               ;;
>> > GlobalLimit 10
>> > +- LocalLimit 10
>> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
>> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
>> > [date_id#7]
>> >      +- Filter (date_id#7 = 2017-07-12)
>> >         +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26
as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >            :- SubqueryAlias a
>> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >            :     +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> chanl_id#8L,
>> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> nav_refer_page_type_id#13,
>> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> nav_page_value#20,
>> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> nav_tcd#26,
>> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> > detl_refer_page_value#30, ... 33 more fields]
>> >            +- SubqueryAlias c
>> >               +- SubqueryAlias dim_site_categ_ext
>> >                  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [site_categ_skid#64L,
>> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> sort_seq#71L,
>> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
>> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
>> > site_categ_id#80, site_categ_parnt_id#81]
>> >
>> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
>> spark
>> > version is 2.2.0.
>> > Any help would be greatly appreciated.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
>> be-supported-in-JOIN-tp21953.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21961.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message