spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xingbo Jiang <jiangxb1...@gmail.com>
Subject Re: Clarify window behavior in Spark SQL
Date Wed, 04 Apr 2018 00:50:29 GMT
This is actually by design, without a `ORDER BY` clause, all rows are
considered as the peer row of the current row, which means that the frame
is effectively the entire partition. This behavior follows the window
syntax of PGSQL.
You can refer to the comment by yhuai:
https://github.com/apache/spark/pull/5604#discussion_r157931911
:)

2018-04-04 6:27 GMT+08:00 Reynold Xin <rxin@databricks.com>:

> Do other (non-Hive) SQL systems do the same thing?
>
> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
> herman@databricks.com> wrote:
>
>> This is something we inherited from Hive: https://cwiki.apache.org
>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>
>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>>> ROW.
>>
>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>>> FOLLOWING.
>>
>>
>> It sort of makes sense if you think about it. If there is no ordering
>> there is no way to have a bound frame. If there is ordering we default to
>> the most commonly used deterministic frame.
>>
>>
>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> Seems like a bug.
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xelloss@gmail.com> wrote:
>>>
>>>> Hi Devs,
>>>>
>>>> I am seeing some behavior with window functions that is a bit
>>>> unintuitive and would like to get some clarification.
>>>>
>>>> When using aggregation function with window, the frame boundary seems
>>>> to change depending on the order of the window.
>>>>
>>>> Example:
>>>> (1)
>>>>
>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>
>>>> w1 = Window.partitionBy('id')
>>>>
>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>
>>>> +---+---+---+
>>>>
>>>> | id|  v| v2|
>>>>
>>>> +---+---+---+
>>>>
>>>> |  0|  1|2.0|
>>>>
>>>> |  0|  2|2.0|
>>>>
>>>> |  0|  3|2.0|
>>>>
>>>> +---+---+---+
>>>>
>>>> (2)
>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>
>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>
>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>
>>>> +---+---+---+
>>>>
>>>> | id|  v| v2|
>>>>
>>>> +---+---+---+
>>>>
>>>> |  0|  1|1.0|
>>>>
>>>> |  0|  2|1.5|
>>>>
>>>> |  0|  3|2.0|
>>>>
>>>> +---+---+---+
>>>>
>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>> boundaries from (
>>>>
>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>> currentRow).
>>>>
>>>>
>>>> I found this behavior a bit unintuitive. I wonder if this behavior is
>>>> by design and if so, what's the specific rule that orderBy() interacts with
>>>> frame boundaries?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Li
>>>>
>>>>
>>>
>>
>

Mime
View raw message