spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Spark performance tuning
Date Sun, 22 Feb 2015 15:56:21 GMT
You can simply follow these http://spark.apache.org/docs/1.2.0/tuning.html

Thanks
Best Regards

On Sun, Feb 22, 2015 at 1:14 AM, java8964 <java8964@hotmail.com> wrote:

> Can someone share some ideas about how to tune the GC time?
>
> Thanks
>
> ------------------------------
> From: java8964@hotmail.com
> To: user@spark.apache.org
> Subject: Spark performance tuning
> Date: Fri, 20 Feb 2015 16:04:23 -0500
>
>
> Hi,
>
> I am new to Spark, and I am trying to test the Spark SQL performance vs
> Hive. I setup a standalone box, with 24 cores and 64G memory.
>
> We have one SQL in mind to test. Here is the basically setup on this one
> box for the SQL we are trying to run:
>
> 1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest
> structure of 3 array of struct in AVRO
> 2) Dataset2, 5G AVRO file with snappy compression
> 3) Dataset3, 2.3M AVRO file with snappy compression.
>
> The basic structure of the query is like this:
>
>
> (select
> xxx
> from
> dataset1 lateral view outer explode(struct1) lateral view outer
> explode(struct2)
> where xxxxx )
> left outer join
> (
> select xxxx from dataset2 lateral view explode(xxx) where xxxx
> )
> on xxxx
> left outer join
> (
> select xxx from dataset3 where xxxx)
> on xxxxx
>
> So overall what it does is 2 outer explode on dataset1, left outer join
> with explode of dataset2, then finally left outer join with dataset 3.
>
> On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark
> 1.2.0.
>
> Baseline, the above query can finish around 50 minutes in Hive 12, with 6
> mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.
>
> This is a very expensive query running in our production, of course with
> much bigger data set, every day. Now I want to see how fast Spark can do
> for the same query.
>
> I am using the following settings, based on my understanding of Spark, for
> a fair test between it and Hive:
>
> export SPARK_WORKER_MEMORY=32g
> export SPARK_DRIVER_MEMORY=2g
> --executor-memory 9g
> --total-executor-cores 9
>
> I am trying to run the one executor with 9 cores and max 9G heap, to make
> Spark use almost same resource we gave to the MapReduce.
> Here is the result without any additional configuration changes, running
> under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same
> query:
>
> The Spark SQL generated 5 stage of tasks, shown below:
> 4   collect at SparkPlan.scala:84 +details      2015/02/20 10:48:46 *26 s*
>    200/200
> 3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 *16
> min*  200/200                     1112.3 MB
> 2   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *9
> min*  40/40       4.7 GB          22.2 GB
> 1   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *1.9*
> min 50/50       6.2 GB          2.8 GB
> 0   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *6 s*
>   2/2         2.3 MB          156.6 KB
>
> So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28
> minutes.
>
> It is about 56% of originally time, not bad. But I want to know any tuning
> of Spark can make it even faster.
>
> For stage 2 and 3, I observed that GC time is more and more expensive.
> Especially in stage 3, shown below:
>
> For stage 3:
> Metric      Min     25th percentile     Median      75th percentile     Max
> Duration    20 s    30 s                35 s        39 s
>  2.4 min
> GC Time     9 s     17 s                20 s        25 s
>  2.2 min
> Shuffle Write   4.7 MB  4.9 MB          5.2 MB      6.1 MB
>  8.3 MB
>
> So in median, the GC time took overall 20s/35s = 57% of time.
>
> First change I made is to add the following line in the spark-default.conf:
> spark.serializer org.apache.spark.serializer.KryoSerializer
>
> My assumption is that using kryoSerializer, instead of default java
> serialize, will lower the memory footprint, should lower the GC pressure
> during runtime. I know the I changed the correct spark-default.conf,
> because if I were add "spark.executor.extraJavaOptions -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" in the same file, I will see
> the GC usage in the stdout file. Of course, in this test, I didn't add
> that, as I want to only make one change a time.
> The result is almost the same, as using standard java serialize. The wall
> time is still 28 minutes, and in stage 3, the GC still took around 50 to
> 60% of time, almost same result within "min, median to max" in stage 3,
> without any noticeable performance gain.
>
> Next, based on my understanding, and for this test, I think the
> default spark.storage.memoryFraction is too high for this query, as there
> is no reason to reserve so much memory for caching data, Because we don't
> reuse any dataset in this one query. So I add this at the end of
> spark-shell command "--conf spark.storage.memoryFraction=0.3", as I want to
> just reserve half of the memory for caching data vs first time. Of course,
> this time, I rollback the first change of "KryoSerializer".
>
> The result looks like almost the same. The whole query finished around 28s
> + 14m + 9.6m + 1.9m + 6s = 27 minutes.
>
> It looks like that Spark is faster than Hive, but is there any steps I can
> make it even faster? Why using "KryoSerializer" makes no difference? If I
> want to use the same resource as now, anything I can do to speed it up
> more, especially lower the GC time?
>
> Thanks
>
> Yong
>
>

Mime
View raw message