spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <Jayesh.Lalw...@capitalone.com>
Subject Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto
Date Thu, 29 Mar 2018 14:44:02 GMT
Without knowing too many details, I can only guess. It could be that Spark is creating a lot
of tasks even though there are less records. Creation and distribution of tasks has a noticeable
overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tvu032@ucr.edu>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to
Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My
experimental setup:
·         TPCDS dataset with scale factor 100 (size 100GB).
·         Spark, Drill, Presto have a same number of workers: 12.
·         Each worked has same allocated amount of memory: 4GB.
·         Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of records), SparkSQL
still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1
second.
For other large tables with billions records, SparkSQL performance was reasonable when it
required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
Mime
View raw message