spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Computing mean and standard deviation by key
Date Fri, 01 Aug 2014 19:39:50 GMT
You're certainly not iterating on the driver. The Iterable you process
in your function is on the cluster and done in parallel.

On Fri, Aug 1, 2014 at 8:36 PM, Kristopher Kalish <kris@kalish.net> wrote:
> The reason I want an RDD is because I'm assuming that iterating the
> individual elements of an RDD on the driver of the cluster is much slower
> than coming up with the mean and standard deviation using a map-reduce-based
> algorithm.
>
> I don't know the intimate details of Spark's implementation, but it seems
> like each iterable element would need to be serialized and sent to the
> driver who would maintain the state (count, sum, total deviation from mean,
> etc), which is a lot of network traffic.
>
> -Kris
>
>
> On Fri, Aug 1, 2014 at 2:57 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> On Fri, Aug 1, 2014 at 7:55 PM, kriskalish <kris@kalish.net> wrote:
>> > I have what seems like a relatively straightforward task to accomplish,
>> > but I
>> > cannot seem to figure it out from the Spark documentation or searching
>> > the
>> > mailing list.
>> >
>> > I have an RDD[(String, MyClass)] that I would like to group by the key,
>> > and
>> > calculate the mean and standard deviation of the "foo" field of MyClass.
>> > It
>> > "feels" like I should be able to use group by to get an RDD for each
>> > unique
>> > key, but it gives me an iterable.
>>
>> Hm, why would you expect or want that? an RDD is a large distributed
>> data set. It's much easier to compute a mean and stdev over an
>> Iterable of numbers than an RDD.
>>
>> You can map your class to its double field and use anything that
>> operates on doubles.
>
>

Mime
View raw message