My assumption is that using kryoSerializer, instead of default java serialize, will lower the memory footprint, should lower the GC pressure during runtime. I know the I changed the correct spark-default.conf, because if I were add "spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" in the same file, I will see the GC usage in the stdout file. Of course, in this test, I didn't add that, as I want to only make one change a time.
The result is almost the same, as using standard java serialize. The wall time is still 28 minutes, and in stage 3, the GC still took around 50 to 60% of time, almost same result within "min, median to max" in stage 3, without any noticeable performance gain.
Next, based on my understanding, and for this test, I think the default spark.storage.memoryFraction is too high for this query, as there is no reason to reserve so much memory for caching data, Because we don't reuse any dataset in this one query. So I add this at the end of spark-shell command "--conf spark.storage.memoryFraction=0.3", as I want to just reserve half of the memory for caching data vs first time. Of course, this time, I rollback the first change of "KryoSerializer".
The result looks like almost the same. The whole query finished around 28s + 14m + 9.6m + 1.9m + 6s = 27 minutes.
It looks like that Spark is faster than Hive, but is there any steps I can make it even faster? Why using "KryoSerializer" makes no difference? If I want to use the same resource as now, anything I can do to speed it up more, especially lower the GC time?