spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kamal Banga <ka...@sigmoidanalytics.com>
Subject Re: about aggregateByKey and standard deviation
Date Mon, 03 Nov 2014 09:53:17 GMT
I don't think directy .aggregateByKey() can be done, because we will need
count of keys (for average). Maybe we can use .countByKey() which returns a
map and .foldByKey(0)(_+_) (or aggregateByKey()) which gives sum of values
per key. I myself ain't getting how to proceed.

Regards

On Fri, Oct 31, 2014 at 1:26 PM, qinwei <wei.qin@dewmobile.net> wrote:

> Hi, everyone
>     I have an RDD filled with data like
>         (k1, v11)
>         (k1, v12)
>         (k1, v13)
>         (k2, v21)
>         (k2, v22)
>         (k2, v23)
>         ...
>
>     I want to calculate the average and standard deviation of (v11, v12,
> v13) and (v21, v22, v23) group by there keys
>     for the moment, i have done that by using groupByKey and map, I notice
> that groupByKey is very expensive,  but i can not figure out how to do it
> by using aggregateByKey, so i wonder is there any better way to do this?
>
> Thanks!
>
> ------------------------------
> qinwei
>

Mime
View raw message