spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anand Nalya <>
Subject Breaking lineage and reducing stages in Spark Streaming
Date Thu, 09 Jul 2015 09:48:09 GMT

I've an application in which an rdd is being updated with tuples coming
from RDDs in a DStream with following pattern.

dstream.foreachRDD(rdd => {
  myRDD = myRDD.union(rdd.filter(myfilter)).reduceByKey(_+_)

I'm using cache() and checkpointin to cache results. Over the time, the
lineage of myRDD keeps increasing and stages in each batch of dstream keeps
increasing, even though all the earlier stages are skipped. When the number
of stages grow big enough, the overall delay due to scheduling delay starts
increasing. The processing time for each batch is still fixed.

Following figures illustrate the problem:

Job execution:


Is there some pattern that I can use to avoid this?


View raw message