spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemminger Jeff <>
Subject Confusing RDD function
Date Wed, 09 Mar 2016 01:41:07 GMT
I'm currently developing a Spark Streaming application.

I have a function that receives an RDD and an object instance as  a
parameter, and returns an RDD:

def doTheThing(a: RDD[A], b: B): RDD[C]

Within the function, I do some processing within a map of the RDD.
Like this:

def doTheThing(a: RDD[A], b: B): RDD[C] {



I combine the RDD by key, then map the results calling a function of
instance b, and return the results.

Here is where I ran into trouble.

In a unit test running Spark in memory, I was able to convince myself that
this worked well.

But in our development environment, the returned RDD results were empty and
b.function(_) was never executed.

However, when I added an otherwise useless foreach:

doTheThing(a: RDD[A], b: B): RDD[C] {

  val results = a.combineByKey(...).map(b.function(_))

  results.foreach( p => p )



Then it works.

So, basically, adding an extra foreach iteration appears to cause
b.function(_) to execute and returns results correctly.

I find this confusing. Can anyone shed some light on why this would be?

Thank you,

View raw message