Can someone share some ideas about how to tune the GC time?


Subject: Spark performance tuning
Date: Fri, 20 Feb 2015 16:04:23 -0500


I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalone box, with 24 cores and 64G memory.

We have one SQL in mind to test. Here is the basically setup on this one box for the SQL we are trying to run:

1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest structure of 3 array of struct in AVRO
2) Dataset2, 5G AVRO file with snappy compression
3) Dataset3, 2.3M AVRO file with snappy compression.

The basic structure of the query is like this:

dataset1 lateral view outer explode(struct1) lateral view outer explode(struct2)
where xxxxx )
left outer join
select xxxx from dataset2 lateral view explode(xxx) where xxxx
on xxxx
left outer join
select xxx from dataset3 where xxxx)
on xxxxx

So overall what it does is 2 outer explode on dataset1, left outer join with explode of dataset2, then finally left outer join with dataset 3.

On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0.

Baseline, the above query can finish around 50 minutes in Hive 12, with 6 mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.

This is a very expensive query running in our production, of course with much bigger data set, every day. Now I want to see how fast Spark can do for the same query.

I am using the following settings, based on my understanding of Spark, for a fair test between it and Hive:

--executor-memory 9g 
--total-executor-cores 9

I am trying to run the one executor with 9 cores and max 9G heap, to make Spark use almost same resource we gave to the MapReduce. 
Here is the result without any additional configuration changes, running under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same query:

The Spark SQL generated 5 stage of tasks, shown below:
4   collect at SparkPlan.scala:84 +details      2015/02/20 10:48:46 26 s    200/200             
3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 16 min  200/200                     1112.3 MB
2   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 9 min  40/40       4.7 GB          22.2 GB
1   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 1.9 min 50/50       6.2 GB          2.8 GB
0   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 6 s     2/2         2.3 MB          156.6 KB

So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes.

It is about 56% of originally time, not bad. But I want to know any tuning of Spark can make it even faster.

For stage 2 and 3, I observed that GC time is more and more expensive. Especially in stage 3, shown below:

For stage 3:
Metric      Min     25th percentile     Median      75th percentile     Max
Duration    20 s    30 s                35 s        39 s                2.4 min
GC Time     9 s     17 s                20 s        25 s                2.2 min
Shuffle Write   4.7 MB  4.9 MB          5.2 MB      6.1 MB              8.3 MB

So in median, the GC time took overall 20s/35s = 57% of time.

First change I made is to add the following line in the spark-default.conf:
spark.serializer org.apache.spark.serializer.KryoSerializer

My assumption is that using kryoSerializer, instead of default java serialize, will lower the memory footprint, should lower the GC pressure during runtime. I know the I changed the correct spark-default.conf, because if I were add "spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" in the same file, I will see the GC usage in the stdout file. Of course, in this test, I didn't add that, as I want to only make one change a time.
The result is almost the same, as using standard java serialize. The wall time is still 28 minutes, and in stage 3, the GC still took around 50 to 60% of time, almost same result within "min, median to max" in stage 3, without any noticeable performance gain.

Next, based on my understanding, and for this test, I think the default is too high for this query, as there is no reason to reserve so much memory for caching data, Because we don't reuse any dataset in this one query. So I add this at the end of spark-shell command "--conf", as I want to just reserve half of the memory for caching data vs first time. Of course, this time, I rollback the first change of "KryoSerializer".

The result looks like almost the same. The whole query finished around 28s + 14m + 9.6m + 1.9m + 6s = 27 minutes.

It looks like that Spark is faster than Hive, but is there any steps I can make it even faster? Why using "KryoSerializer" makes no difference? If I want to use the same resource as now, anything I can do to speed it up more, especially lower the GC time?