spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ray <ray.and...@gmail.com>
Subject Re: Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.
Date Tue, 04 Apr 2017 01:23:37 GMT
You probably don't want null safe equals (<=>) with a left join.

On Mon, Apr 3, 2017 at 5:46 PM gjohnson35 <gjohnson@artemishealth.com>
wrote:

> The join condition with && is throwing an exception:
>
>  val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=>
> baseDF("medical_claim_id")
>       && mccDF("medical_claim_detail_id") <=>
> baseDF("medical_claim_detail_id"), "left")
>       .join(revCdDF, revCdDF("revenue_code_padded_str") <=>
> mccDF("mcc_code"), "left")
>       .select(baseDF("medical_claim_id"),
> baseDF("medical_claim_detail_id"),
> baseDF("revenue_code"), baseDF("rev_code_distinct_count"),
>         baseDF("rtos_1_1_count"), baseDF("rtos_1_0_count"),
> baseDF("er_visit_flag"), baseDF("observation_stay_flag"),
>         revCdDF("rtos_2_code"), revCdDF("rtos_2_hierarchy"))
>       .where(revCdDF("rtos_2_code").between(8, 27).isNotNull)
>       .groupBy(
>         baseDF("medical_claim_id"),
>         baseDF("medical_claim_detail_id")
>       )
>       .agg(min(revCdDF("rtos_2_code").alias("min_rtos_2_8_thru_27")),
> min(revCdDF("rtos_2_hierarchy").alias("min_rtos_2_8_thru_27_hier")))
>
>
> This query runs fine:
>
> val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=>
> baseDF("medical_claim_id"), "left")
>         .join(mccDF, mccDF("medical_claim_detail_id") <=>
> baseDF("medical_claim_detail_id"), "left")
>       .join(revCdDF, revCdDF("revenue_code_padded_str") <=>
> mccDF("mcc_code"), "left")
>       .select(baseDF("medical_claim_id"),
> baseDF("medical_claim_detail_id"),
> baseDF("revenue_code"), baseDF("rev_code_distinct_count"),
>         baseDF("rtos_1_1_count"), baseDF("rtos_1_0_count"),
> baseDF("er_visit_flag"), baseDF("observation_stay_flag"),
>         revCdDF("rtos_2_code"), revCdDF("rtos_2_hierarchy"))
>       .where(revCdDF("rtos_2_code").between(8, 27).isNotNull)
>       .groupBy(
>         baseDF("medical_claim_id"),
>         baseDF("medical_claim_detail_id")
>       )
>       .agg(min(revCdDF("rtos_2_code").alias("min_rtos_2_8_thru_27")),
> min(revCdDF("rtos_2_hierarchy").alias("min_rtos_2_8_thru_27_hier")))
>
> If I remove the multiple Columns in the join and create a join statement
> for
> each one then the exception goes away.  Is there a better way to join
> multiple columns?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-left-join-with-multiple-columns-Join-condition-is-missing-or-trivial-Use-the-CROSS-JOIN-syntax-tp21297.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message