spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Hubert <>
Subject RE: Breaking lineage and reducing stages in Spark Streaming
Date Thu, 09 Jul 2015 10:05:42 GMT

I was just wondering how you generated to second image with the charts.
What product?

From: Anand Nalya []
Sent: donderdag 9 juli 2015 11:48
To: spark users
Subject: Breaking lineage and reducing stages in Spark Streaming


I've an application in which an rdd is being updated with tuples coming from RDDs in a DStream
with following pattern.

dstream.foreachRDD(rdd => {
  myRDD = myRDD.union(rdd.filter(myfilter)).reduceByKey(_+_)

I'm using cache() and checkpointin to cache results. Over the time, the lineage of myRDD keeps
increasing and stages in each batch of dstream keeps increasing, even though all the earlier
stages are skipped. When the number of stages grow big enough, the overall delay due to scheduling
delay starts increasing. The processing time for each batch is still fixed.

Following figures illustrate the problem:

Job execution:
[Image removed by sender.]
[Image removed by sender.]
Is there some pattern that I can use to avoid this?

View raw message