spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Clarify window behavior in Spark SQL
Date Wed, 04 Apr 2018 00:54:55 GMT
Ah ok. Thanks for commenting. Everyday I learn something new about SQL.

For others to follow, SQL Server has a good explanation of the behavior:
https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-
transact-sql


Can somebody (Li?) update the API documentation to specify the gotchas, in
case users are not familiar with SQL window function semantics?



General Remarks
<https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>

More than one window function can be used in a single query with a single
FROM clause. The OVER clause for each function can differ in partitioning
and ordering.

If PARTITION BY is not specified, the function treats all rows of the query
result set as a single group.
Important!
<https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>

If ROWS/RANGE is specified and <window frame preceding> is used for <window
frame extent> (short syntax) then this specification is used for the window
frame boundary starting point and CURRENT ROW is used for the boundary
ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS BETWEEN 5
PRECEDING AND CURRENT ROW”.

Note+

If ORDER BY is not specified entire partition is used for a window frame.
This applies only to functions that do not require ORDER BY clause. If
ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
PRECEDING AND CURRENT ROW is used as default for window frame. This applies
only to functions that have can accept optional ROWS/RANGE specification.
For example, ranking functions cannot accept ROWS/RANGE, therefore this
window frame is not applied even though ORDER BY is present and ROWS/RANGE
is not.





On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1987@gmail.com> wrote:

> This is actually by design, without a `ORDER BY` clause, all rows are
> considered as the peer row of the current row, which means that the frame
> is effectively the entire partition. This behavior follows the window
> syntax of PGSQL.
> You can refer to the comment by yhuai: https://github.com/apac
> he/spark/pull/5604#discussion_r157931911
> :)
>
> 2018-04-04 6:27 GMT+08:00 Reynold Xin <rxin@databricks.com>:
>
>> Do other (non-Hive) SQL systems do the same thing?
>>
>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>> herman@databricks.com> wrote:
>>
>>> This is something we inherited from Hive: https://cwiki.apache.org
>>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>
>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>> CURRENT ROW.
>>>
>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>> UNBOUNDED FOLLOWING.
>>>
>>>
>>> It sort of makes sense if you think about it. If there is no ordering
>>> there is no way to have a bound frame. If there is ordering we default to
>>> the most commonly used deterministic frame.
>>>
>>>
>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> Seems like a bug.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xelloss@gmail.com> wrote:
>>>>
>>>>> Hi Devs,
>>>>>
>>>>> I am seeing some behavior with window functions that is a bit
>>>>> unintuitive and would like to get some clarification.
>>>>>
>>>>> When using aggregation function with window, the frame boundary seems
>>>>> to change depending on the order of the window.
>>>>>
>>>>> Example:
>>>>> (1)
>>>>>
>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>
>>>>> w1 = Window.partitionBy('id')
>>>>>
>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> | id|  v| v2|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> |  0|  1|2.0|
>>>>>
>>>>> |  0|  2|2.0|
>>>>>
>>>>> |  0|  3|2.0|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> (2)
>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>
>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>
>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> | id|  v| v2|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> |  0|  1|1.0|
>>>>>
>>>>> |  0|  2|1.5|
>>>>>
>>>>> |  0|  3|2.0|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>> boundaries from (
>>>>>
>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>> currentRow).
>>>>>
>>>>>
>>>>> I found this behavior a bit unintuitive. I wonder if this behavior is
>>>>> by design and if so, what's the specific rule that orderBy() interacts
with
>>>>> frame boundaries?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Li
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message