spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenchen Fan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result
Date Fri, 20 May 2016 07:45:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292935#comment-15292935
] 

Wenchen Fan commented on SPARK-15441:
-------------------------------------

The problem is, we use `CreateStruct` to construct Rows for each join side, so that encoders
can work on it.  However, for outer join, our join implementation will null out every column
for one join side, if the join condition is not satisfied. Now `CreateStruct` sees a bunch
of null columns and create a row with null fields, while we are expecting to get a null row
here.

The tricky part is, the reason why these columns are null is important to encoders. If they
are null because the join condition is not satisfied, encoders expect a null row. If they
are null because the record really contains null for these columns, encoders expect a row
with null fields.

I can't think of a simple fix for it, a possible solution is to extend our join implementation
to carry out the reason why all columns of one join side are null.

> dataset outer join seems to return incorrect result
> ---------------------------------------------------
>
>                 Key: SPARK-15441
>                 URL: https://issues.apache.org/jira/browse/SPARK-15441
>             Project: Spark
>          Issue Type: Bug
>          Components: sq;
>            Reporter: Reynold Xin
>            Priority: Critical
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), functions.col("left.k")
=== functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +---------+-----+
> |       _1|   _2|
> +---------+-----+
> |    (a,2)|(a,x)|
> |    (a,1)|(a,x)|
> |    (b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +---------+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message