spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lantao Jin (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-30114) Optimize LIMIT only query by partial listing files
Date Wed, 04 Dec 2019 01:45:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lantao Jin updated SPARK-30114:
-------------------------------
    Description: 
We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation like
1) SELECT * FROM TABLE_A LIMIT N
2) SELECT colA FROM TABLE_A LIMIT N
3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the execution
time would be very big since it has to list all files to build a RDD before execution. But
almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all files.
This optimization will create a *SinglePartitionReadRDD* to address it.

In our production result, this optimization benefits a lot. The duration time of simple query
with LIMIT could reduce 5~10 times. For example, before this optimization, a query on a table
which has about one hundred thousands files would run over 30 seconds, after applying this
optimization, the time decreased to 5 seconds.


Should support both Spark DataSource Table and Hive Table which can be converted to DataSource
table.
Should support bucket table, partition table, normal table.
Should support different file formats like parquet, orc.

  was:
We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation. When
we execute some queries like
1) SELECT * FROM TABLE_A LIMIT N
2) SELECT colA FROM TABLE_A LIMIT N
3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the execution
time would be very big since it has to list all files to build a RDD before execution. But
almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all files.
This optimization will create a *SinglePartitionReadRDD* to address it.

In our production result, this optimization benefits a lot. The duration time of simple query
with LIMIT could reduce 5~10 times. For example, before this optimization, a query on a table
which has about one hundred thousands files would run over 30 seconds, after applying this
optimization, the time decreased to 5 seconds.


Should support both Spark DataSource Table and Hive Table which can be converted to DataSource
table.
Should support bucket table, partition table, normal table.
Should support different file formats like parquet, orc.


> Optimize LIMIT only query by partial listing files
> --------------------------------------------------
>
>                 Key: SPARK-30114
>                 URL: https://issues.apache.org/jira/browse/SPARK-30114
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Lantao Jin
>            Priority: Major
>
> We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation
like
> 1) SELECT * FROM TABLE_A LIMIT N
> 2) SELECT colA FROM TABLE_A LIMIT N
> 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
> If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the
execution time would be very big since it has to list all files to build a RDD before execution.
But almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all
files. This optimization will create a *SinglePartitionReadRDD* to address it.
> In our production result, this optimization benefits a lot. The duration time of simple
query with LIMIT could reduce 5~10 times. For example, before this optimization, a query on
a table which has about one hundred thousands files would run over 30 seconds, after applying
this optimization, the time decreased to 5 seconds.
> Should support both Spark DataSource Table and Hive Table which can be converted to DataSource
table.
> Should support bucket table, partition table, normal table.
> Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message