spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemminger Jeff <j...@atware.co.jp>
Subject Re: Confusing RDD function
Date Wed, 09 Mar 2016 03:37:00 GMT
Thank you, yes that makes sense.
I was aware of transformations and actions, but did not realize foreach was
an action. I've found the exhaustive list here
http://spark.apache.org/docs/latest/programming-guide.html#actions
and it's clear to me again.

Thank you for your help!

On Wed, Mar 9, 2016 at 11:37 AM, Jakob Odersky <jakob@odersky.com> wrote:

> Hi Jeff,
>
> > But in our development environment, the returned RDD results were empty
> and b.function(_) was never executed
> what do you mean by "the returned RDD results were empty", did you try
> running a foreach, collect or any other action on the returned RDD[C]?
>
> Spark provides two kinds of operations on RDDs:
> 1. transformations, which return a new RDD and are lazy and
> 2. actions that actually run an RDD and return some kind of result.
> In your example above, 'map' is a transformation and thus is not
> actually applied until some action (like 'foreach') is called on the
> resulting RDD.
> You can find more information in the Spark Programming Guide
> http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.
>
> best,
> --Jakob
>
> On Tue, Mar 8, 2016 at 5:41 PM, Hemminger Jeff <jeff@atware.co.jp> wrote:
> >
> > I'm currently developing a Spark Streaming application.
> >
> > I have a function that receives an RDD and an object instance as  a
> > parameter, and returns an RDD:
> >
> > def doTheThing(a: RDD[A], b: B): RDD[C]
> >
> >
> > Within the function, I do some processing within a map of the RDD.
> > Like this:
> >
> >
> > def doTheThing(a: RDD[A], b: B): RDD[C] {
> >
> >   a.combineByKey(...).map(b.function(_))
> >
> > }
> >
> >
> > I combine the RDD by key, then map the results calling a function of
> > instance b, and return the results.
> >
> > Here is where I ran into trouble.
> >
> > In a unit test running Spark in memory, I was able to convince myself
> that
> > this worked well.
> >
> > But in our development environment, the returned RDD results were empty
> and
> > b.function(_) was never executed.
> >
> > However, when I added an otherwise useless foreach:
> >
> >
> > doTheThing(a: RDD[A], b: B): RDD[C] {
> >
> >   val results = a.combineByKey(...).map(b.function(_))
> >
> >   results.foreach( p => p )
> >
> >   results
> >
> > }
> >
> >
> > Then it works.
> >
> > So, basically, adding an extra foreach iteration appears to cause
> > b.function(_) to execute and returns results correctly.
> >
> > I find this confusing. Can anyone shed some light on why this would be?
> >
> > Thank you,
> > Jeff
> >
> >
> >
>

Mime
View raw message