spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-28225) Unexpected behavior for Window functions
Date Tue, 16 Jul 2019 16:22:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886271#comment-16886271
] 

Sean Owen commented on SPARK-28225:
-----------------------------------

That's a weird one and I agree with what you expect. The 'nulls last' doesn't matter as you
demonstrate so you could exclude it to simplify this one. The answers are right except that
they should always be the same across rows.

I think there's probably some rule that's taking precedence incorrectly that if y is null
then the result of first(y).over(...) must be null but I can't see how that's valid.

> Unexpected behavior for Window functions
> ----------------------------------------
>
>                 Key: SPARK-28225
>                 URL: https://issues.apache.org/jira/browse/SPARK-28225
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Andrew Leverentz
>            Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function with an
ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
>     Row(101, null, 11),
>     Row(102, null, 12),
>     Row(103, null, 13),
>     Row(203, 24, null),
>     Row(201, 26, null),
>     Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, the results
I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+----+----+----+----+---+---+---+---+----+----+
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+----+----+----+----+---+---+---+---+----+----+
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+----+----+----+----+---+---+---+---+----+----+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message