spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Clarify window behavior in Spark SQL
Date Tue, 03 Apr 2018 21:09:49 GMT
Seems like a bug.



On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xelloss@gmail.com> wrote:

> Hi Devs,
>
> I am seeing some behavior with window functions that is a bit unintuitive
> and would like to get some clarification.
>
> When using aggregation function with window, the frame boundary seems to
> change depending on the order of the window.
>
> Example:
> (1)
>
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w1 = Window.partitionBy('id')
>
> df.withColumn('v2', mean(df.v).over(w1)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|2.0|
>
> |  0|  2|2.0|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> (2)
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w2 = Window.partitionBy('id').orderBy('v')
>
> df.withColumn('v2', mean(df.v).over(w2)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|1.0|
>
> |  0|  2|1.5|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> Seems like orderBy('v') in the example (2) also changes the frame
> boundaries from (
>
> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
> currentRow).
>
>
> I found this behavior a bit unintuitive. I wonder if this behavior is by
> design and if so, what's the specific rule that orderBy() interacts with
> frame boundaries?
>
>
> Thanks,
>
> Li
>
>

Mime
View raw message