spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Jin <ice.xell...@gmail.com>
Subject Clarify window behavior in Spark SQL
Date Tue, 03 Apr 2018 20:26:32 GMT
Hi Devs,

I am seeing some behavior with window functions that is a bit unintuitive
and would like to get some clarification.

When using aggregation function with window, the frame boundary seems to
change depending on the order of the window.

Example:
(1)

df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

w1 = Window.partitionBy('id')

df.withColumn('v2', mean(df.v).over(w1)).show()

+---+---+---+

| id|  v| v2|

+---+---+---+

|  0|  1|2.0|

|  0|  2|2.0|

|  0|  3|2.0|

+---+---+---+

(2)
df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

w2 = Window.partitionBy('id').orderBy('v')

df.withColumn('v2', mean(df.v).over(w2)).show()

+---+---+---+

| id|  v| v2|

+---+---+---+

|  0|  1|1.0|

|  0|  2|1.5|

|  0|  3|2.0|

+---+---+---+

Seems like orderBy('v') in the example (2) also changes the frame
boundaries from (

unboundedPreceding, unboundedFollowing) to (unboundedPreceding, currentRow).


I found this behavior a bit unintuitive. I wonder if this behavior is by
design and if so, what's the specific rule that orderBy() interacts with
frame boundaries?


Thanks,

Li

Mime
View raw message