If the join condition is non-deterministic, pushing it down to the underlying project will change the semantics. Thus, we are unable to do it in PullOutNondeterministic. Users can do it manually if they do not care the semantics difference. 

Thanks,

Xiao



2017-07-16 20:07 GMT-07:00 Chang Chen <baibaichen@gmail.com>:
It is tedious since we have lots of Hive SQL being migrated to Spark.  And this workaround is equivalent  to insert a Project between Join operator and its child.

Why not do it in PullOutNondeterministic?

Thanks
Chang


On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh <viirya@gmail.com> wrote:

A possible workaround is to add the rand column into tbl1 with a projection
before the join.

SELECT a.col1
FROM (
  SELECT col1,
    CASE
         WHEN col2 IS NULL
           THEN cast(rand(9)*1000 - 9999999999 as string)
         ELSE
           col2
    END AS col2
    FROM tbl1) a
LEFT OUTER JOIN tbl2 b
ON a.col2 = b.col3;



Chang Chen wrote
> Hi Wenchen
>
> Yes. We also find this error is caused by Rand. However, this is classic
> way to solve data skew in Hive.  Is there any equivalent way in Spark?
>
> Thanks
> Chang
>
> On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;

> cloud0fan@

> &gt; wrote:
>
>> It’s not about case when, but about rand(). Non-deterministic expressions
>> are not allowed in join condition.
>>
>> > On 13 Jul 2017, at 6:43 PM, wangshuang &lt;

> cn_wss@

> &gt; wrote:
>> >
>> > I'm trying to execute hive sql on spark sql (Also on spark
>> thriftserver), For
>> > optimizing data skew, we use "case when" to handle null.
>> > Simple sql as following:
>> >
>> >
>> > SELECT a.col1
>> > FROM tbl1 a
>> > LEFT OUTER JOIN tbl2 b
>> > ON
>> > *     CASE
>> >               WHEN a.col2 IS NULL
>> >                       TNEN cast(rand(9)*1000 - 9999999999 as string)
>> >               ELSE
>> >                       a.col2 END *
>> >       = b.col3;
>> >
>> >
>> > But I get the error:
>> >
>> > == Physical Plan ==
>> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
>> are
>> > only allowed in
>> > Project, Filter, Aggregate or Window, found:*
>> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
>> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
>> END
>> =
>> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> (c.`cur_flag`
>> =
>> > 1))
>> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >               ;;
>> > GlobalLimit 10
>> > +- LocalLimit 10
>> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
>> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
>> > [date_id#7]
>> >      +- Filter (date_id#7 = 2017-07-12)
>> >         +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >            :- SubqueryAlias a
>> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >            :     +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> chanl_id#8L,
>> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> nav_refer_page_type_id#13,
>> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> nav_page_value#20,
>> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> nav_tcd#26,
>> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> > detl_refer_page_value#30, ... 33 more fields]
>> >            +- SubqueryAlias c
>> >               +- SubqueryAlias dim_site_categ_ext
>> >                  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [site_categ_skid#64L,
>> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> sort_seq#71L,
>> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
>> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
>> > site_categ_id#80, site_categ_parnt_id#81]
>> >
>> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
>> spark
>> > version is 2.2.0.
>> > Any help would be greatly appreciated.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
>> be-supported-in-JOIN-tp21953.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail:

> dev-unsubscribe@.apache

>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21961.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org