http://spark.apache.org/docs/latest/streaming-programming-guide.html

foreachRDD is executed on the driver….

mn

On Oct 20, 2014, at 3:07 AM, Gerard Maas <gerard.maas@gmail.com> wrote:

Pinging TD  -- I'm sure you know :-)

-kr, Gerard.

On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas <gerard.maas@gmail.com> wrote:
Hi,

We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces.

We've been following the pattern:

dstream.foreachRDD(rdd => 
    val records = rdd.map(elem => record(elem))
    targets.foreach(target => records.filter{record => isTarget(target,record)}.writeToCassandra(target,table))
)

I've been wondering whether there would be a performance difference in transforming the dstream instead of transforming the RDD within the dstream with regards to how the transformations get scheduled.

Instead of the RDD-centric computation, I could transform the dstream until the last step, where I need an rdd to store.
For example, the  previous  transformation could be written as:

val recordStream = dstream.map(elem => record(elem))
targets.foreach{target => recordStream.filter(record => isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}

Would  be a difference in execution and/or performance?  What would be the preferred way to do this?

Bonus question: Is there a better (more performant) way to sort the data in different "buckets" instead of filtering the data collection times the #buckets?

thanks,  Gerard.