spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Siebeling <rsiebel...@gmail.com>
Subject Re: combining operations elegantly
Date Mon, 24 Mar 2014 23:48:52 GMT
Hi guys,

thanks for the information, I'll give it a try with Algebird,
thanks again,
Richard

@Patrick, thanks for the release calendar


On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell <pwendell@gmail.com>wrote:

> Hey All,
>
> I think the old thread is here:
> https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
>
> The method proposed in that thread is to create a utility class for
> doing single-pass aggregations. Using Algebird is a pretty good way to
> do this and is a bit more flexible since you don't need to create a
> new utility each time you want to do this.
>
> In Spark 1.0 and later you will be able to do this more elegantly with
> the schema support:
> myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
> Average('duration) as 'duration)
>
> and it will use a single pass automatically... but that's not quite
> released yet :)
>
> - Patrick
>
>
>
>
> On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers <koert@tresata.com> wrote:
> > i currently typically do something like this:
> >
> > scala> val rdd = sc.parallelize(1 to 10)
> > scala> import com.twitter.algebird.Operators._
> > scala> import com.twitter.algebird.{Max, Min}
> > scala> rdd.map{ x => (
> >      |   1L,
> >      |   Min(x),
> >      |   Max(x),
> >      |   x
> >      | )}.reduce(_ + _)
> > res0: (Long, com.twitter.algebird.Min[Int],
> com.twitter.algebird.Max[Int],
> > Int) = (10,Min(1),Max(10),55)
> >
> > however for this you need twitter algebird dependency. without that you
> have
> > to code the reduce function on the tuples yourself...
> >
> > another example with 2 columns, where i do conditional count for first
> > column, and simple sum for second:
> > scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
> >      |   if (x > 5) 1 else 0,
> >      |   y
> >      | )}.reduce(_ + _)
> > res3: (Int, Int) = (5,155)
> >
> >
> >
> > On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rsiebeling@gmail.com
> >
> > wrote:
> >>
> >> Hi Koert, Patrick,
> >>
> >> do you already have an elegant solution to combine multiple operations
> on
> >> a single RDD?
> >> Say for example that I want to do a sum over one column, a count and an
> >> average over another column,
> >>
> >> thanks in advance,
> >> Richard
> >>
> >>
> >> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling <
> rsiebeling@gmail.com>
> >> wrote:
> >>>
> >>> Patrick, Koert,
> >>>
> >>> I'm also very interested in these examples, could you please post them
> if
> >>> you find them?
> >>> thanks in advance,
> >>> Richard
> >>>
> >>>
> >>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <koert@tresata.com>
> wrote:
> >>>>
> >>>> not that long ago there was a nice example on here about how to
> combine
> >>>> multiple operations on a single RDD. so basically if you want to do
a
> >>>> count() and something else, how to roll them into a single job. i
> think
> >>>> patrick wendell gave the examples.
> >>>>
> >>>> i cant find them anymore.... patrick can you please repost? thanks!
> >>>
> >>>
> >>
> >
>

Mime
View raw message