Cool, learn something new every day.  Thanks again.

On Tue, Aug 9, 2016 at 4:08 PM ayan guha <> wrote:

Thanks for reporting back. Glad it worked for you. Actually sum with partitioning behaviour is same in oracle too.

On 10 Aug 2016 03:01, "Jon Barksdale" <> wrote:
Hi Santoshakhilesh, 

I'd seen that already, but I was trying to avoid using rdds to perform this calculation.

@Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) totally works.  I guess I expected the windowing with sum to work more like oracle.  Thanks for the suggestion :)

Thank you both for your help, 


On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <> wrote:

You could check following link.


From: Jon Barksdale []
Sent: 09 August 2016 08:21
To: ayan guha
Cc: user
Subject: Re: Cumulative Sum function using Dataset API


I don't think that would work properly, and would probably just give me the sum for each partition. I'll give it a try when I get home just to be certain.

To maybe explain the intent better, if I have a column (pre sorted) of (1,2,3,4), then the cumulative sum would return (1,3,6,10).

Does that make sense? Naturally, if ordering a sum turns it into a cumulative sum, I'll gladly use that :)


On Mon, Aug 8, 2016 at 4:55 PM ayan guha <> wrote:

You mean you are not able to use sum(col) over (partition by key order by some_col) ?


On Tue, Aug 9, 2016 at 9:53 AM, jon <> wrote:

Hi all,

I'm trying to write a function that calculates a cumulative sum as a column
using the Dataset API, and I'm a little stuck on the implementation.  From
what I can tell, UserDefinedAggregateFunctions don't seem to support
windowing clauses, which I think I need for this use case.  If I write a
function that extends from AggregateWindowFunction, I end up needing classes
that are package private to the sql package, so I need to make my function
under the org.apache.spark.sql package, which just feels wrong.

I've also considered writing a custom transformer, but haven't spend as much
time reading through the code, so I don't know how easy or hard that would

TLDR; What's the best way to write a function that returns a value for every
row, but has mutable state, and gets row in a specific order?

Does anyone have any ideas, or examples?



View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:



Best Regards,
Ayan Guha