spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenchen Fan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result
Date Sat, 21 May 2016 15:08:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295066#comment-15295066
] 

Wenchen Fan commented on SPARK-15441:
-------------------------------------

I think we can't always transform a row with all columns are nulls into null. Let's say we
have a `case class Person(name: String, email: String)`, Person(null, null) is different from
null. So I think we need to know why a row's all columns are nulls, if it's because this row
really have these columns in null values, we should have Person(null, null), if it's because
it's an outer join and the join condition is not satisfied, we should have null.

> dataset outer join seems to return incorrect result
> ---------------------------------------------------
>
>                 Key: SPARK-15441
>                 URL: https://issues.apache.org/jira/browse/SPARK-15441
>             Project: Spark
>          Issue Type: Bug
>          Components: sq;
>            Reporter: Reynold Xin
>            Assignee: Wenchen Fan
>            Priority: Critical
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), functions.col("left.k")
=== functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +---------+-----+
> |       _1|   _2|
> +---------+-----+
> |    (a,2)|(a,x)|
> |    (a,1)|(a,x)|
> |    (b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +---------+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message