spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10496) Efficient DataFrame cumulative sum
Date Sat, 08 Oct 2016 21:10:21 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15558724#comment-15558724
] 

Reynold Xin commented on SPARK-10496:
-------------------------------------

I think there are two separate issues here:

1. The API to run cumulative sum right now is fairly awkward. Either do it through a complicated
join, or through window functions that still look fairly verbose. I've created a notebook
that contains two short examples to do this in SQL and in DataFrames: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html

It would make sense to me to create a simpler API for this case, since it is very common.
This API under the hood can just call the existing window function API.

2. The implementation, for cases when there is a single window partition, is slow, because
it requires shuffling all the data. This can technically be run just a prefix scan. In this
case, I'd have an optimizer rule or physical plan changes to improve this.



> Efficient DataFrame cumulative sum
> ----------------------------------
>
>                 Key: SPARK-10496
>                 URL: https://issues.apache.org/jira/browse/SPARK-10496
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Goal: Given a DataFrame with a numeric column X, create a new column Y which is the cumulative
sum of X.
> This can be done with window functions, but it is not efficient for a large number of
rows.  It could be done more efficiently using a prefix sum/scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message