From Mahdi Namazifar <>
Subject A chain of lazy operations starts running tasks
Date Tue, 24 Sep 2013 00:47:36 GMT

I think I might be missing something but here is what I observe which is
inconsistent with my understanding of transformation vs action operations:
 in the Spark shell I do the following

val a = sc.textFile("[my file]", 1000)
val c = a.flatMap(line => line.split("\t")).map(word => (word,
1)).reduceByKey((a,b)=>a+b, 100).sortByKey(false,500)

which is for experimentation purposes only and I'm running a word count on
a file that is read from HDFS and then I sort the result by the words.

My understanding from the documentation is that all of flatMap, map,
reduceByKey, and sortByKey are transformation operations and are therefore
lazy operations.  But when I run the second line, I see 1000
ShuffleMapTasks, followed by 100 ResultTasks and another 100 ResultTasks
running on the cluster which in total take about 400 seconds.  Am I missing
something?  Could someone kindly explain to me what exactly happens when I
run the second command, because I was expecting for the command to only
create an RDD and not perform any tasks.

BTW, I'm using Spark 0.7.2 on a 1+4 node cluster.


