Okay, the reason for the task delay within executor when some RDD in memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY, in this case Scheduler waits
for spark.locality.wait 3 seconds default. During this period, scheduler waits to launch a data-local task before giving up and launching it on a less-local node.

So after making it 0, all tasks started parallel. But learned that it is better not to reduce it to 0.


On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph <prabhujose.gates@gmail.com> wrote:
Hi All,


Sample Spark application which reads a logfile from hadoop (1.2GB - 5 RDD's created each approx 250MB data) and there are two jobs. Job A gets the line with "a" and the Job B gets the line with "b". The spark application is ran multiple times, each time with
different executor memory, and enable/disable cache() function. Job A performance is same in all the runs as it has to read the entire data first time from Disk.

Spark Cluster - standalone mode with Spark Master, single worker node (12 cores, 16GB memory)

    val logData = sc.textFile(logFile, 2)
    var numAs = logData.filter(line => line.contains("a")).count()
    var numBs = logData.filter(line => line.contains("b")).count()
   

Job B (which has 5 tasks) results below:

   
Run 1: 1 executor with 2GB memory, 12 cores took 2 seconds [ran1 image]

    Since logData is not cached, the job B has to again read the 1.2GB data from hadoop into memory and all the 5 tasks started parallel and each took 2 sec (29ms for GC) and the
 overall job completed in 2 seconds.
   
Run 2: 1 executor with 2GB memory, 12 cores and logData is cached took 4 seconds [ran2 image, ran2_cache image]

     val logData = sc.textFile(logFile, 2).cache()
     
     The Executor does not have enough memory to cache and hence again needs to read the entire 1.2GB data from hadoop into memory.  But since the cache() is used, leads to lot of GC pause leading to slowness in task completion. Each task started parallel and
completed in 4 seconds (more than 1 sec for GC).

Run 3: 1 executor with 6GB memory, 12 cores and logData is cached took 10 seconds [ran3 image]

     The Executor has memory that can fit 4 RDD partitions into memory but 5th RDD it has to read from Hadoop. 4 tasks are started parallel and they completed in 0.3 seconds without GC. But the 5th task which has to read RDD from disk is started after 4 seconds, and gets completed in 2 seconds. Analysing why the 5th task is not started parallel with other tasks or at least why it is not started immediately after the other task completion.
     
Run 4: 1 executor with 16GB memory , 12 cores and logData is cached took 0.3 seconds [ran4 image]

     The executor has enough memory to cache all the 5 RDD. All 5 tasks are started in parallel and gets completed within 0.3 seconds.
     
     
So Spark performs well when entire input data is in Memory or None. In case of some RDD in memory and some from disk, there is a delay in scheduling the fifth task, is it a expected behavior or a possible Bug.



Thanks,
Prabhu Joseph