spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table
Date Sun, 30 Aug 2015 11:12:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721483#comment-14721483
] 

Apache Spark commented on SPARK-9926:
-------------------------------------

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8512

> Parallelize file listing for partitioned Hive table
> ---------------------------------------------------
>
>                 Key: SPARK-9926
>                 URL: https://issues.apache.org/jira/browse/SPARK-9926
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.1, 1.5.0
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very slowly against
partitioned Hive tables because of file listing. In particular, if a large number of partitions
are scanned on storage like S3, the queries run extremely slowly. Here are some example benchmarks
in my environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and dateint<=20150610
limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive partition
path and group them into a UnionRDD. Then, all the input files are listed sequentially. In
other tools such as Hive and Pig, this can be solved by setting [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
high. But in Spark, since each HadoopRDD lists only one partition path, setting this property
doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message