spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: combining operations elegantly
Date Sun, 23 Mar 2014 23:16:05 GMT
Hey All,

I think the old thread is here:
https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J

The method proposed in that thread is to create a utility class for
doing single-pass aggregations. Using Algebird is a pretty good way to
do this and is a bit more flexible since you don't need to create a
new utility each time you want to do this.

In Spark 1.0 and later you will be able to do this more elegantly with
the schema support:
myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
Average('duration) as 'duration)

and it will use a single pass automatically... but that's not quite
released yet :)

- Patrick




On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers <koert@tresata.com> wrote:
> i currently typically do something like this:
>
> scala> val rdd = sc.parallelize(1 to 10)
> scala> import com.twitter.algebird.Operators._
> scala> import com.twitter.algebird.{Max, Min}
> scala> rdd.map{ x => (
>      |   1L,
>      |   Min(x),
>      |   Max(x),
>      |   x
>      | )}.reduce(_ + _)
> res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int],
> Int) = (10,Min(1),Max(10),55)
>
> however for this you need twitter algebird dependency. without that you have
> to code the reduce function on the tuples yourself...
>
> another example with 2 columns, where i do conditional count for first
> column, and simple sum for second:
> scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
>      |   if (x > 5) 1 else 0,
>      |   y
>      | )}.reduce(_ + _)
> res3: (Int, Int) = (5,155)
>
>
>
> On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rsiebeling@gmail.com>
> wrote:
>>
>> Hi Koert, Patrick,
>>
>> do you already have an elegant solution to combine multiple operations on
>> a single RDD?
>> Say for example that I want to do a sum over one column, a count and an
>> average over another column,
>>
>> thanks in advance,
>> Richard
>>
>>
>> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <rsiebeling@gmail.com>
>> wrote:
>>>
>>> Patrick, Koert,
>>>
>>> I'm also very interested in these examples, could you please post them if
>>> you find them?
>>> thanks in advance,
>>> Richard
>>>
>>>
>>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <koert@tresata.com> wrote:
>>>>
>>>> not that long ago there was a nice example on here about how to combine
>>>> multiple operations on a single RDD. so basically if you want to do a
>>>> count() and something else, how to roll them into a single job. i think
>>>> patrick wendell gave the examples.
>>>>
>>>> i cant find them anymore.... patrick can you please repost? thanks!
>>>
>>>
>>
>

Mime
View raw message