spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prabhu Joseph <>
Subject Spark job does not perform well when some RDD in memory and some on Disk
Date Mon, 01 Feb 2016 08:32:44 GMT
Hi All,

Sample Spark application which reads a logfile from hadoop (1.2GB - 5 RDD's
created each approx 250MB data) and there are two jobs. Job A gets the line
with "a" and the Job B gets the line with "b". The spark application is ran
multiple times, each time with
different executor memory, and enable/disable cache() function. Job A
performance is same in all the runs as it has to read the entire data first
time from Disk.

Spark Cluster - standalone mode with Spark Master, single worker node (12
cores, 16GB memory)

    val logData = sc.textFile(logFile, 2)
    var numAs = logData.filter(line => line.contains("a")).count()
    var numBs = logData.filter(line => line.contains("b")).count()

*Job B (which has 5 tasks) results below:*

*Run 1:* 1 executor with 2GB memory, 12 cores took 2 seconds [ran1 image]

    Since logData is not cached, the job B has to again read the 1.2GB data
from hadoop into memory and all the 5 tasks started parallel and each took
2 sec (29ms for GC) and the
 overall job completed in 2 seconds.

*Run 2:* 1 executor with 2GB memory, 12 cores and logData is cached took 4
seconds [ran2 image, ran2_cache image]

     val logData = sc.textFile(logFile, 2).cache()

     The Executor does not have enough memory to cache and hence again
needs to read the entire 1.2GB data from hadoop into memory.  But since the
cache() is used, leads to lot of GC pause leading to slowness in task
completion. Each task started parallel and
completed in 4 seconds (more than 1 sec for GC).

*Run 3: 1 executor with 6GB memory, 12 cores and logData is cached took 10
seconds [ran3 image]*

     The Executor has memory that can fit 4 RDD partitions into memory but
5th RDD it has to read from Hadoop. 4 tasks are started parallel and they
completed in 0.3 seconds without GC. But the 5th task which has to read RDD
from disk is started after 4 seconds, and gets completed in 2 seconds.
Analysing why the 5th task is not started parallel with other tasks or at
least why it is not started immediately after the other task completion.

*Run 4:* 1 executor with 16GB memory , 12 cores and logData is cached took
0.3 seconds [ran4 image]

     The executor has enough memory to cache all the 5 RDD. All 5 tasks are
started in parallel and gets completed within 0.3 seconds.

So Spark performs well when entire input data is in Memory or None. In case
of some RDD in memory and some from disk, there is a delay in scheduling
the fifth task, is it a expected behavior or a possible Bug.

Prabhu Joseph

View raw message