spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakob Odersky <ja...@odersky.com>
Subject Re: Confusing RDD function
Date Wed, 09 Mar 2016 02:37:28 GMT
Hi Jeff,

> But in our development environment, the returned RDD results were empty and b.function(_)
was never executed
what do you mean by "the returned RDD results were empty", did you try
running a foreach, collect or any other action on the returned RDD[C]?

Spark provides two kinds of operations on RDDs:
1. transformations, which return a new RDD and are lazy and
2. actions that actually run an RDD and return some kind of result.
In your example above, 'map' is a transformation and thus is not
actually applied until some action (like 'foreach') is called on the
resulting RDD.
You can find more information in the Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.

best,
--Jakob

On Tue, Mar 8, 2016 at 5:41 PM, Hemminger Jeff <jeff@atware.co.jp> wrote:
>
> I'm currently developing a Spark Streaming application.
>
> I have a function that receives an RDD and an object instance as  a
> parameter, and returns an RDD:
>
> def doTheThing(a: RDD[A], b: B): RDD[C]
>
>
> Within the function, I do some processing within a map of the RDD.
> Like this:
>
>
> def doTheThing(a: RDD[A], b: B): RDD[C] {
>
>   a.combineByKey(...).map(b.function(_))
>
> }
>
>
> I combine the RDD by key, then map the results calling a function of
> instance b, and return the results.
>
> Here is where I ran into trouble.
>
> In a unit test running Spark in memory, I was able to convince myself that
> this worked well.
>
> But in our development environment, the returned RDD results were empty and
> b.function(_) was never executed.
>
> However, when I added an otherwise useless foreach:
>
>
> doTheThing(a: RDD[A], b: B): RDD[C] {
>
>   val results = a.combineByKey(...).map(b.function(_))
>
>   results.foreach( p => p )
>
>   results
>
> }
>
>
> Then it works.
>
> So, basically, adding an extra foreach iteration appears to cause
> b.function(_) to execute and returns results correctly.
>
> I find this confusing. Can anyone shed some light on why this would be?
>
> Thank you,
> Jeff
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message