spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <>
Subject [jira] [Commented] (SPARK-10496) Efficient DataFrame cumulative sum
Date Sat, 08 Oct 2016 21:10:21 GMT


Reynold Xin commented on SPARK-10496:

I think there are two separate issues here:

1. The API to run cumulative sum right now is fairly awkward. Either do it through a complicated
join, or through window functions that still look fairly verbose. I've created a notebook
that contains two short examples to do this in SQL and in DataFrames:

It would make sense to me to create a simpler API for this case, since it is very common.
This API under the hood can just call the existing window function API.

2. The implementation, for cases when there is a single window partition, is slow, because
it requires shuffling all the data. This can technically be run just a prefix scan. In this
case, I'd have an optimizer rule or physical plan changes to improve this.

> Efficient DataFrame cumulative sum
> ----------------------------------
>                 Key: SPARK-10496
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Joseph K. Bradley
>            Priority: Minor
> Goal: Given a DataFrame with a numeric column X, create a new column Y which is the cumulative
sum of X.
> This can be done with window functions, but it is not efficient for a large number of
rows.  It could be done more efficiently using a prefix sum/scan.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message