spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-17845) Improve window function frame boundary API in DataFrame
Date Mon, 10 Oct 2016 04:30:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-17845:
--------------------------------
    Description: 
ANSI SQL uses the following to specify the frame boundaries for window functions:

{code}
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
{code}

In Spark's DataFrame API, we use integer values to indicate relative position:
- 0 means "CURRENT ROW"
- -1 means "1 PRECEDING"
- Long.MinValue means "UNBOUNDED PRECEDING"
- Long.MaxValue to indicate "UNBOUNDED FOLLOWING"

{code}
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Window.rowsBetween(Long.MinValue, 0)

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
Window.rowsBetween(-3, 3)

// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Window.rowsBetween(Long.MinValue, Long.MaxValue)
{code}

I think using numeric values to indicate relative positions is actually a good idea, but the
reliance on Long.MinValue and Long.MaxValue to indicate unbounded ends is pretty confusing:

1. The API is not self-evident. There is no way for a new user to figure out how to indicate
an unbounded frame by looking at just the API. The user has to read the doc to figure this
out.
2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
3. Different languages have different min/max values, e.g. in Python we use -sys.maxsize and
+sys.maxsize.



  was:
In SQL, we use the following to specify the frame boundaries for window functions:




> Improve window function frame boundary API in DataFrame
> -------------------------------------------------------
>
>                 Key: SPARK-17845
>                 URL: https://issues.apache.org/jira/browse/SPARK-17845
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window functions:
> {code}
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, 3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a good idea,
but the reliance on Long.MinValue and Long.MaxValue to indicate unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out how to indicate
an unbounded frame by looking at just the API. The user has to read the doc to figure this
out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use -sys.maxsize
and +sys.maxsize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message