spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan R. Sparks" <evan.spa...@gmail.com>
Subject Re: Computing mean and standard deviation by key
Date Fri, 01 Aug 2014 21:29:47 GMT
Ignoring my warning about overflow - even more functional - just use a
reduceByKey.

Since your main operation is just a bunch of summing, you've got a
commutative-associative reduce operation and spark will run do everything
cluster-parallel, and then shuffle the (small) result set and merge
appropriately.

For example:
input
  .map{ case (k, v) => (k, (1, v, v*v)) }
  .reduceByKey { case ((c1, s1, ss1), (c2, s2, ss2)) => (c1+c2, s1+s2,
ss1+ss2) }
  .map { case (k, (count, sum, sumsq)) => (k, sumsq/count - (sum/count *
sum/count)) }

This is by no means the most memory/time efficient way to do it, but I
think it's a nice example of how to think about using spark at a higher
level of abstraction.

- Evan



On Fri, Aug 1, 2014 at 2:00 PM, Sean Owen <sowen@cloudera.com> wrote:

> Here's the more functional programming-friendly take on the
> computation (but yeah this is the naive formula):
>
> rdd.groupByKey.mapValues { mcs =>
>   val values = mcs.map(_.foo.toDouble)
>   val n = values.count
>   val sum = values.sum
>   val sumSquares = values.map(x => x * x).sum
>   math.sqrt(n * sumSquares - sum * sum) / n
> }
>
> This gives you a bunch of (key,stdev). I think you want to compute
> this RDD and *then* do something to save it if you like. Sure, that
> could be collecting it locally and saving to a DB. Or you could use
> foreach to do something remotely for every key-value pair. More
> efficient would be to mapPartitions and do something to a whole
> partition of key-value pairs at a time.
>
>
> On Fri, Aug 1, 2014 at 9:56 PM, kriskalish <kris@kalish.net> wrote:
> > So if I do something like this, spark handles the parallelization and
> > recombination of sum and count on the cluster automatically? I started
> > peeking into the source and see that foreach does submit a job to the
> > cluster, but it looked like the inner function needed to return
> something to
> > work properly.
> >
> > val grouped = rdd.groupByKey()
> > grouped.foreach{ x =>
> >     val iterable = x._2
> >     var sum = 0.0
> >     var count = 0
> >     iterable.foreach{ y =>
> >         sum = sum + y.foo
> >         count = count + 1
> >     }
> >     val mean = sum/count;
> >     // save mean to database...
> > }
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message