spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries
Date Wed, 26 Aug 2020 07:48:16 GMT
Hi,

are you using s3a, which is not using EMRFS? In that case, these results
does not make sense to me.

Regards,
Gourav Sengupta

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <
abhishek.rao@nokia.com> wrote:

> Hi All,
>
>
>
> We’re doing some performance comparisons between Spark querying data on
> HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage)
> using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is
> consuming significantly larger duration for some set of queries when
> compared with HDFS.
>
> We also ran similar queries with Spark 2.4.5 querying data from S3 and we
> see that for these set of queries, time taken by Spark 2.4.5 is lesser
> compared to Spark 3.0 looks to be very strange.
>
> Below are the details of 9 queries where Spark 3.0 is taking >5 times the
> duration for running queries on S3 when compared to Hadoop.
>
>
>
> *Environment Details:*
>
>    - *Spark running on Kubernetes*
>    - *TPC DS Scale Factor*: *500 GB*
>    - *Hadoop 3.x*
>    - *Same CPU and memory used for all executions*
>
>
>
> *Query*
>
> *Spark 3.0 with S3 (Time in seconds)*
>
> *Spark 3.0 with Hadoop (Time in seconds)*
>
>
>
>
>
> *Spark 2.4.5 with S3 *
>
> *(Time in seconds)*
>
> *Spark 3.0 HDFS vs S3 (Factor)*
>
> *Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)*
>
> *Table involved*
>
> 9
>
> 880.129
>
> 106.109
>
> 147.65
>
> *8.294574*
>
> *5.960914*
>
> store_sales
>
> 44
>
> 129.618
>
> 23.747
>
> 103.916
>
> *5.458289*
>
> *1.247334*
>
> store_sales
>
> 58
>
> 142.113
>
> 20.996
>
> 33.936
>
> *6.768575*
>
> *4.187677*
>
> store_sales
>
> 62
>
> 32.519
>
> 5.425
>
> 14.809
>
> *5.994286*
>
> *2.195894*
>
> web_sales
>
> 76
>
> 138.765
>
> 20.73
>
> 49.892
>
> *6.693922*
>
> *2.781308*
>
> store_sales
>
> 88
>
> 475.824
>
> 48.2
>
> 94.382
>
> *9.871867*
>
> *5.04147*
>
> store_sales
>
> 90
>
> 53.896
>
> 6.804
>
> 18.11
>
> *7.921223*
>
> *2.976035*
>
> web_sales
>
> 94
>
> 241.172
>
> 43.49
>
> 81.181
>
> *5.545459*
>
> *2.970794*
>
> web_sales
>
> 96
>
> 67.059
>
> 10.396
>
> 15.993
>
> *6.450462*
>
> *4.193022*
>
> store_sales
>
>
>
> When we analysed it further, we see that all these queries are performing
> operations either on store_sales or web_sales tables and Spark 3 with S3
> seems to be downloading much more data from storage when compared to Spark
> 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for
> query completion. I’m attaching the screen shots of Driver UI for one such
> instance (Query 9) for reference.
>
> Also attached the spark configurations (Spark 3.0) used for these tests.
>
>
>
> We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on
> what we’re missing?
>
>
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Mime
View raw message