spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Clarify window behavior in Spark SQL
Date Tue, 03 Apr 2018 22:27:57 GMT
Do other (non-Hive) SQL systems do the same thing?

On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
herman@databricks.com> wrote:

> This is something we inherited from Hive: https://cwiki.apache.
> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>
> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>> ROW.
>
> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>> FOLLOWING.
>
>
> It sort of makes sense if you think about it. If there is no ordering
> there is no way to have a bound frame. If there is ordering we default to
> the most commonly used deterministic frame.
>
>
> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rxin@databricks.com> wrote:
>
>> Seems like a bug.
>>
>>
>>
>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xelloss@gmail.com> wrote:
>>
>>> Hi Devs,
>>>
>>> I am seeing some behavior with window functions that is a bit
>>> unintuitive and would like to get some clarification.
>>>
>>> When using aggregation function with window, the frame boundary seems to
>>> change depending on the order of the window.
>>>
>>> Example:
>>> (1)
>>>
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w1 = Window.partitionBy('id')
>>>
>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|2.0|
>>>
>>> |  0|  2|2.0|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> (2)
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w2 = Window.partitionBy('id').orderBy('v')
>>>
>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|1.0|
>>>
>>> |  0|  2|1.5|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> Seems like orderBy('v') in the example (2) also changes the frame
>>> boundaries from (
>>>
>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>> currentRow).
>>>
>>>
>>> I found this behavior a bit unintuitive. I wonder if this behavior is by
>>> design and if so, what's the specific rule that orderBy() interacts with
>>> frame boundaries?
>>>
>>>
>>> Thanks,
>>>
>>> Li
>>>
>>>
>>
>

Mime
View raw message