I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup:
  • TPCDS dataset with scale factor 100 (size 100GB).
  • Spark, Drill, Presto have a same numberĀ of workers: 12.
  • Each worked has same allocated amount of memory: 4GB.
  • Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?