spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marco Gaido (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-28225) Unexpected behavior for Window functions
Date Sat, 20 Jul 2019 14:44:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889518#comment-16889518
] 

Marco Gaido edited comment on SPARK-28225 at 7/20/19 2:43 PM:
--------------------------------------------------------------

Let me cite PostgreSQL documentation to explain you the behavior:

??When an aggregate function is used as a window function, it aggregates over the rows within
the current row's window frame. An aggregate used with ORDER BY and the default window frame
definition produces a "running sum" type of behavior, which may or may not be what's wanted.
To obtain aggregation over the whole partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING. Other frame specifications can be used to obtain other
effects.??

So the returned values seem correct to me.


was (Author: mgaido):
Let me cite PostgreSQL documentation to explain you the behavior:

{noformat}
When an aggregate function is used as a window function, it aggregates over the rows within
the current row's window frame. An aggregate used with ORDER BY and the default window frame
definition produces a "running sum" type of behavior, which may or may not be what's wanted.
To obtain aggregation over the whole partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING. Other frame specifications can be used to obtain other
effects.
{noformat}

So the returned values seem correct to me.


> Unexpected behavior for Window functions
> ----------------------------------------
>
>                 Key: SPARK-28225
>                 URL: https://issues.apache.org/jira/browse/SPARK-28225
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Andrew Leverentz
>            Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function with an
ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
>     Row(101, null, 11),
>     Row(102, null, 12),
>     Row(103, null, 13),
>     Row(203, 24, null),
>     Row(201, 26, null),
>     Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, the results
I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+----+----+----+----+---+---+---+---+----+----+
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+----+----+----+----+---+---+---+---+----+----+
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+----+----+----+----+---+---+---+---+----+----+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message