spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lantao Jin (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-30114) Improve LIMIT only query by partial listing files
Date Wed, 04 Dec 2019 01:47:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lantao Jin updated SPARK-30114:
-------------------------------
    Summary: Improve LIMIT only query by partial listing files  (was: Optimize LIMIT only
query by partial listing files)

> Improve LIMIT only query by partial listing files
> -------------------------------------------------
>
>                 Key: SPARK-30114
>                 URL: https://issues.apache.org/jira/browse/SPARK-30114
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Lantao Jin
>            Priority: Major
>
> We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation
like
> 1) SELECT * FROM TABLE_A LIMIT N
> 2) SELECT colA FROM TABLE_A LIMIT N
> 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
> If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the
execution time would be very big since it has to list all files to build a RDD before execution.
But almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all
files. This optimization will create a *SinglePartitionReadRDD* to address it.
> In our production result, this optimization benefits a lot. The duration time of simple
query with LIMIT could reduce 5~10 times. For example, before this optimization, a query on
a table which has about one hundred thousands files would run over 30 seconds, after applying
this optimization, the time decreased to 5 seconds.
> Should support both Spark DataSource Table and Hive Table which can be converted to DataSource
table.
> Should support bucket table, partition table, normal table.
> Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message