spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <>
Subject Spark performance tuning
Date Fri, 20 Feb 2015 21:04:23 GMT
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalone
box, with 24 cores and 64G memory.
We have one SQL in mind to test. Here is the basically setup on this one box for the SQL we
are trying to run:
1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest structure of 3 array
of struct in AVRO2) Dataset2, 5G AVRO file with snappy compression3) Dataset3, 2.3M AVRO file
with snappy compression.
The basic structure of the query is like this:

(selectxxxfromdataset1 lateral view outer explode(struct1) lateral view outer explode(struct2)where
xxxxx )left outer join(select xxxx from dataset2 lateral view explode(xxx) where xxxx)on xxxxleft
outer join(select xxx from dataset3 where xxxx)on xxxxx
So overall what it does is 2 outer explode on dataset1, left outer join with explode of dataset2,
then finally left outer join with dataset 3.
On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0.
Baseline, the above query can finish around 50 minutes in Hive 12, with 6 mappers and 3 reducers,
each with 1G max heap, in 3 rounds of MR jobs.
This is a very expensive query running in our production, of course with much bigger data
set, every day. Now I want to see how fast Spark can do for the same query.
I am using the following settings, based on my understanding of Spark, for a fair test between
it and Hive:
export SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2g--executor-memory 9g --total-executor-cores
I am trying to run the one executor with 9 cores and max 9G heap, to make Spark use almost
same resource we gave to the MapReduce. Here is the result without any additional configuration
changes, running under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same
The Spark SQL generated 5 stage of tasks, shown below:4   collect at SparkPlan.scala:84 +details
     2015/02/20 10:48:46 26 s    200/200             3   mapPartitions at Exchange.scala:64
+details 2015/02/20 10:32:07 16 min  200/200                     1112.3 MB2   mapPartitions
at Exchange.scala:64 +details 2015/02/20 10:22:06 9 min  40/40       4.7 GB          22.2
GB1   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 1.9 min 50/50      
6.2 GB          2.8 GB0   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06
6 s     2/2         2.3 MB          156.6 KB
So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes.
It is about 56% of originally time, not bad. But I want to know any tuning of Spark can make
it even faster.
For stage 2 and 3, I observed that GC time is more and more expensive. Especially in stage
3, shown below:
For stage 3:Metric      Min     25th percentile     Median      75th percentile     MaxDuration
   20 s    30 s                35 s        39 s                2.4 minGC Time     9 s    
17 s                20 s        25 s                2.2 minShuffle Write   4.7 MB  4.9 MB
         5.2 MB      6.1 MB              8.3 MB
So in median, the GC time took overall 20s/35s = 57% of time.
First change I made is to add the following line in the spark-default.conf:spark.serializer
My assumption is that using kryoSerializer, instead of default java serialize, will lower
the memory footprint, should lower the GC pressure during runtime. I know the I changed the
correct spark-default.conf, because if I were add "spark.executor.extraJavaOptions -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" in the same file, I will see the GC usage in the
stdout file. Of course, in this test, I didn't add that, as I want to only make one change
a time.The result is almost the same, as using standard java serialize. The wall time is still
28 minutes, and in stage 3, the GC still took around 50 to 60% of time, almost same result
within "min, median to max" in stage 3, without any noticeable performance gain.
Next, based on my understanding, and for this test, I think the default
is too high for this query, as there is no reason to reserve so much memory for caching data,
Because we don't reuse any dataset in this one query. So I add this at the end of spark-shell
command "--conf", as I want to just reserve half of the memory
for caching data vs first time. Of course, this time, I rollback the first change of "KryoSerializer".
The result looks like almost the same. The whole query finished around 28s + 14m + 9.6m +
1.9m + 6s = 27 minutes.
It looks like that Spark is faster than Hive, but is there any steps I can make it even faster?
Why using "KryoSerializer" makes no difference? If I want to use the same resource as now,
anything I can do to speed it up more, especially lower the GC time?
View raw message