If spark.locality.wait is 0, then there are two performance issues:

   1. Task Scheduler won't wait to schedule the tasks as DATA_LOCAL, will launch it immediately on some node even if it is less local. The probability of tasks running as less local will be higher
and affect the overall Job Performance.
      2. In case of Executor having not enough heap memory, some tasks which has RDD on cache and some other has on hadoop, and if spark.locality.wait is 0, all the tasks will starts parallel and since the Executor Process is both Memory and IO intensive, the GC will be high and tasks will be slower.








     


     

On Thu, Feb 4, 2016 at 5:13 PM, Alonso Isidoro Roman <alonsoir@gmail.com> wrote:
"But learned that it is better not to reduce it to 0."

could you explain a bit more this sentence?

thanks

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-02-04 11:33 GMT+01:00 Prabhu Joseph <prabhujose.gates@gmail.com>:
Okay, the reason for the task delay within executor when some RDD in memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY, in this case Scheduler waits
for spark.locality.wait 3 seconds default. During this period, scheduler waits to launch a data-local task before giving up and launching it on a less-local node.

So after making it 0, all tasks started parallel. But learned that it is better not to reduce it to 0.


On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph <prabhujose.gates@gmail.com> wrote:
Hi All,


Sample Spark application which reads a logfile from hadoop (1.2GB - 5 RDD's created each approx 250MB data) and there are two jobs. Job A gets the line with "a" and the Job B gets the line with "b". The spark application is ran multiple times, each time with
different executor memory, and enable/disable cache() function. Job A performance is same in all the runs as it has to read the entire data first time from Disk.

Spark Cluster - standalone mode with Spark Master, single worker node (12 cores, 16GB memory)

    val logData = sc.textFile(logFile, 2)
    var numAs = logData.filter(line => line.contains("a")).count()
    var numBs = logData.filter(line => line.contains("b")).count()
   

Job B (which has 5 tasks) results below:

   
Run 1: 1 executor with 2GB memory, 12 cores took 2 seconds [ran1 image]

    Since logData is not cached, the job B has to again read the 1.2GB data from hadoop into memory and all the 5 tasks started parallel and each took 2 sec (29ms for GC) and the
 overall job completed in 2 seconds.
   
Run 2: 1 executor with 2GB memory, 12 cores and logData is cached took 4 seconds [ran2 image, ran2_cache image]

     val logData = sc.textFile(logFile, 2).cache()
     
     The Executor does not have enough memory to cache and hence again needs to read the entire 1.2GB data from hadoop into memory.  But since the cache() is used, leads to lot of GC pause leading to slowness in task completion. Each task started parallel and
completed in 4 seconds (more than 1 sec for GC).

Run 3: 1 executor with 6GB memory, 12 cores and logData is cached took 10 seconds [ran3 image]

     The Executor has memory that can fit 4 RDD partitions into memory but 5th RDD it has to read from Hadoop. 4 tasks are started parallel and they completed in 0.3 seconds without GC. But the 5th task which has to read RDD from disk is started after 4 seconds, and gets completed in 2 seconds. Analysing why the 5th task is not started parallel with other tasks or at least why it is not started immediately after the other task completion.
     
Run 4: 1 executor with 16GB memory , 12 cores and logData is cached took 0.3 seconds [ran4 image]

     The executor has enough memory to cache all the 5 RDD. All 5 tasks are started in parallel and gets completed within 0.3 seconds.
     
     
So Spark performs well when entire input data is in Memory or None. In case of some RDD in memory and some from disk, there is a delay in scheduling the fifth task, is it a expected behavior or a possible Bug.



Thanks,
Prabhu Joseph