Andrew Leverentz created SPARK-28225:
----------------------------------------
Summary: Unexpected behavior for Window functions
Key: SPARK-28225
URL: https://issues.apache.org/jira/browse/SPARK-28225
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.0
Reporter: Andrew Leverentz
I've noticed some odd behavior when combining the "first" aggregate function with an ordered
Window.
In particular, I'm working with columns created using the syntax
{code}
first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
{code}
Below, I'm including some code which reproduces this issue in a Databricks notebook.
*Code:*
{code:java}
import org.apache.spark.sql.functions.first
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
val schema = StructType(Seq(
StructField("x", IntegerType, false),
StructField("y", IntegerType, true),
StructField("z", IntegerType, true)
))
val input =
spark.createDataFrame(sc.parallelize(Seq(
Row(101, null, 11),
Row(102, null, 12),
Row(103, null, 13),
Row(203, 24, null),
Row(201, 26, null),
Row(202, 25, null)
)), schema = schema)
input.show
val output = input
.withColumn("u1", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last)))
.withColumn("u2", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc)))
.withColumn("u3", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last)))
.withColumn("u4", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc)))
.withColumn("u5", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last)))
.withColumn("u6", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc)))
.withColumn("u7", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last)))
.withColumn("u8", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc)))
output.show
{code}
*Expectation:*
Based on my understanding of how ordered-Window and aggregate functions work, the results
I expected to see were:
* u1 = u2 = constant value of 26
* u3 = u4 = constant value of 24
* u5 = u6 = constant value of 11
* u7 = u8 = constant value of 13
However, columns u1, u2, u7, and u8 contain some unexpected nulls.
*Results:*
{code:java}
+---+----+----+----+----+---+---+---+---+----+----+
| x| y| z| u1| u2| u3| u4| u5| u6| u7| u8|
+---+----+----+----+----+---+---+---+---+----+----+
|203| 24|null| 26| 26| 24| 24| 11| 11|null|null|
|202| 25|null| 26| 26| 24| 24| 11| 11|null|null|
|201| 26|null| 26| 26| 24| 24| 11| 11|null|null|
|103|null| 13|null|null| 24| 24| 11| 11| 13| 13|
|102|null| 12|null|null| 24| 24| 11| 11| 13| 13|
|101|null| 11|null|null| 24| 24| 11| 11| 13| 13|
+---+----+----+----+----+---+---+---+---+----+----+
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org
|