spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Barksdale <jon.barksd...@gmail.com>
Subject Re: Cumulative Sum function using Dataset API
Date Tue, 09 Aug 2016 17:01:28 GMT
Hi Santoshakhilesh,

I'd seen that already, but I was trying to avoid using rdds to perform this
calculation.

@Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) totally
works.  I guess I expected the windowing with sum to work more like
oracle.  Thanks for the suggestion :)

Thank you both for your help,

Jon

On Tue, Aug 9, 2016 at 3:01 AM Santoshakhilesh <santosh.akhilesh@huawei.com>
wrote:

> You could check following link.
>
>
> http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark
>
>
>
> *From:* Jon Barksdale [mailto:jon.barksdale@gmail.com]
> *Sent:* 09 August 2016 08:21
> *To:* ayan guha
> *Cc:* user
> *Subject:* Re: Cumulative Sum function using Dataset API
>
>
>
> I don't think that would work properly, and would probably just give me
> the sum for each partition. I'll give it a try when I get home just to be
> certain.
>
> To maybe explain the intent better, if I have a column (pre sorted) of
> (1,2,3,4), then the cumulative sum would return (1,3,6,10).
>
> Does that make sense? Naturally, if ordering a sum turns it into a
> cumulative sum, I'll gladly use that :)
>
> Jon
>
> On Mon, Aug 8, 2016 at 4:55 PM ayan guha <guha.ayan@gmail.com> wrote:
>
> You mean you are not able to use sum(col) over (partition by key order by
> some_col) ?
>
>
>
> On Tue, Aug 9, 2016 at 9:53 AM, jon <jon.barksdale@gmail.com> wrote:
>
> Hi all,
>
> I'm trying to write a function that calculates a cumulative sum as a column
> using the Dataset API, and I'm a little stuck on the implementation.  From
> what I can tell, UserDefinedAggregateFunctions don't seem to support
> windowing clauses, which I think I need for this use case.  If I write a
> function that extends from AggregateWindowFunction, I end up needing
> classes
> that are package private to the sql package, so I need to make my function
> under the org.apache.spark.sql package, which just feels wrong.
>
> I've also considered writing a custom transformer, but haven't spend as
> much
> time reading through the code, so I don't know how easy or hard that would
> be.
>
> TLDR; What's the best way to write a function that returns a value for
> every
> row, but has mutable state, and gets row in a specific order?
>
> Does anyone have any ideas, or examples?
>
> Thanks,
>
> Jon
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>

Mime
View raw message